Apache Spark is an open-source cluster computing framework that is revolutionising the Big Data world. Spark outperforms Hadoop by up to 100 times in memory and 10 times on disc, according to recent studies. We'll go over the Spark Architecture and its core components in this blog.
In this section, we will discuss on Apache Spark architecture and its core components. Apache spark is built upon 3 main components as Data Storage, API and Resource Management. In this section, we will discuss about these 3 building blocks of the framework.
Core Components
As per Data Storage, Spark is built upon an HDFS file system and capable of handling data from HBase or Cassandra systems as well. Spark API consists of interfaces to develop applications based on it in Java, Python and Scala languages. Using Spark, resource management can be done either in a single server instance or using a framework such as Mesos or YARN in a distributed manner.
Resilient Distributed Data Set
Apache Spark is built upon RDD (Resilient Distributed Data Set) concept. Similar to a table in a database, an RDD can hold data in various formats and is an immutable distributed collection of records. Spark can alternatively execute multiple programs between RDDs and RDD can recover faults efficiently through recomputing lost partitions upon a failure.
If you call any transformation upon an RDD, new RDD will be returned, while the original remains unmodified, since RDDs are immutable objects.
An RDD has 2 sets of parallel operations as Transformation and Action. A Transformation operation will always return an RDD, not a value and no evaluation happens in that case. Example Transformation operations are map, filter, flatMap, groupByKey, reduceByKey, sample, union, etc.
An Action operation, evaluates called functions on RDDs they are called upon,executing the queries and returns a value as the result. Example Action operations are count, reduce, collect, take, etc.
Getting Hands Dirty with Spark
Let us now install Apache spark and run a simple word count application. You can either use Spark setups available from vendors like Cloudera, MapR or HortonWorks or use it in the cloud. Spark needs Java installed on the system for it to run on your local machine. Hence we will first set up Java Development Kit. The steps described below are for a machine running Windows operating system. Note that,steps to set up Apache Spark on Linux or Mac OS will be similar but, the manner of setting up the environment variables may differ.
Installing Java
Installing JDK is quite straight forward. Download the JDK (Version 1.7 recommended) from the official vendor (Oracle) website and run the installer. Once installation is completed, verify successful completion by running below in the command line. Upon successful installation, it will show the Java version.
java -version
Now let us install Spark.
Installing Spark
To install Spark on the system, navigate to the official Apache Spark website, download the latest version,unzip the file if necessary and move it to a convenient location. Move to that folder and launch the Spark Shell. An example is shown below, assuming it has been extracted to the location: C:spark_setup.
C:
cd C:spark_setup spark-1.4.0-bin-hadoop2.1
binspark-shell
If you see the below prompt, Whola!! You have installed Spark successfully.
15/07/21 13:15:32 INFO Http-Server:- Starting HTTP-Server
15/07/21 13:15:32 INFO Utils:- service is successfully in progress ‘HTTP-class_
server’ on the port of 58132.
Welcome:
Making Use of Scala (version is 2.10.4) (Java HotSpot)
Enter the expressions to evaluate them.
Type:- Taking help to find further info.
15/07/21 13:15:41INFO BlockManagerMaster: Registered BlockManager
15/07/21 13:15:41 INFO Spark-I Loop: Spark-context is created..
Spark-context can be made available – as sc.
To verify whether Spark shell executes properly, try below
commands.
sc.version
sc.appName
To quit the shell, use below command.
:quit
Windows does not come with Python interpreter and hence, to run the Spark Python shell, we need to setup Pyhon in our environment. Python official website provides an installer for Windws or we can use a package like Anaconda, which comes with an added collection of computational tools written in Python.
Once python has been installed, you can launch the Spark Python Shell by executing pyspark in Spark installation directory. An example is given below:
C:
cd C:spark_setup spark-1.4.0-bin-hadoop2.1
binpyspark
That is all we need to run Apache Spark interactive shell in Scala or Python. It also comes with a web console. Let us see how to use Apache Spark web console.
Using Apache Spark web Console
While working with Spark, to view analysis results and other information, navigate to below URL.
MasterURL for different modes:
Connection to Spark engine can be done in different modes. When running Spark locally or on cloud, this is done configuring the ‘MasterUrl’ parameter as per below.
Setting MasterURL parameter to:
mesos://zk: host_name:port_number
Checkout Apache Spark Interview Questions
Shared Variables
Two types of shared variables can be used in Apache Spark to speed up the applications running on a cluster.
// Initializing broadcast variables
valbroadCastElement = sc.broadcast(Array(‘Nirman’, ‘Shan’, ‘Srini’))
// using broadcast variables
broadCastElement.value
Below, is an example way of using an accumulator in Scalaprompt.
//Usage of accumulator variables
ValaccumuatorVar = sc.accumulator(0, “Examle Accumulator Variable”)
sc.parallelize(Array(‘Nirman’, ‘Shan’, ‘Srini’)).foreach(i =>accumuatorVar += i)
accumuatorVar.value
With the tools in hand, let us now collect the pieces together and build our simple word count application.
Word Count Application with Apache Spark
Using Spark API, data can be easily read from text files and processed. With the below example in Scalashell and we’ll see how they can be used.
To run the conventional word count application,in a Scala shell, run below commands.
importorg.apache.spark.SparkContext
importorg.apache.spark.SparkContext._
valtextFileRead = “sample_data.md”
valtextFileData = sc.textFile(textFileRead)
textFileData.cache()
Calling cache(),stores the RDD in cache and it can be easily read in further queries.cache() will be lazy-evaluated, meaning that it will be storing data not immediately, but whenever an action is called upon the RDD.
To read number of lines in the text file, run below command.
textFileData.count()
Now, can print out the word count next to each word in the file, as below.
valwordCountData= textFileData.flatMap(list =>list.split(” “)).map(word => (word, 1)).reduceByKey(_ + _)
wordCountData.collect().foreach(println)
Are you looking to get trained on Apache Spark, we have the right course designed according to your needs. Our expert trainers help you gain the essential knowledge required for the latest industry needs. Join our Apache Spark Certification Training program from your nearest city.
Apache Spark Training Bangalore
These courses are equipped with Live Instructor-Led Training, Industry Use cases, and hands-on live projects. Additionally, you get access to Free Mock Interviews, Job and Certification Assistance by Certified Apache Spark Trainer
Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:
Name | Dates | |
---|---|---|
Apache Spark Training | Nov 19 to Dec 04 | View Details |
Apache Spark Training | Nov 23 to Dec 08 | View Details |
Apache Spark Training | Nov 26 to Dec 11 | View Details |
Apache Spark Training | Nov 30 to Dec 15 | View Details |
Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.