Programming and Data – Alla bout programming and bigdata

This is more on Programming and BigData concepts like Apache Hadoop, Apache Spark, Flink..etc. Series of Posts for Apache Spark given here.

Apache Spark

Apache Spark is a framework or maybe a paradigm to process data in a distributed environment. The Storage layer can be anything like HDFS, S3, DBFS..etc. Certainly started with three components Storage Layer, Cluster Manager and Spark Itself, which are three major classified components in Spark Architecture.

As discussed storage can be HDFS(Hadoop Distributed File System), S3, DBFS(Databricks File System)..etc. Cluster Manager coordinates cluster resources negotiation to processes, providing compute resources to spark jobs submitted. Spark is a processing engine composed of different evolved components including Spark Core, SQL, Spark Streaming, GraphX and Spark ML.

Spark Core is the main execution engine on which all the other modules like Spark SQL, Spark Streaming, Spark GraphX and Spark ML are built. Spark Core provides APIs in Scala, Java and Python. Spark main data abstraction is RDD, and RDD is Spark’s fundamental datatype. Everything works around RDD, Spark SQL Dataframe and Dataset operations are converted to RDD-based implementation by either optimizers or spark dags.

Spark SQL provides a table-like structure on Data called Dataframes and typed version Datasets.