What is Apache Spark?

Oversikt: Apache spark is a high performance general engine used to process large scale data. It is an open source framework used for cluster computing. Aim of this framework is to make the data analytic faster – both in terms of development and execution. In this document, I will talk about Apache Spark and discuss the various aspects of this framework.

Innledning: Apache spark is an open source framework for cluster computing. It is built on top of the Hadoop Distributed File System (HDFS). It doesn’t use the two stage map reduce paradigm. But at the same time it promises up to 100 times faster performance for certain applications. Spark also provides the initial leads for cluster computing within the memory. This enables the application programs to load the data into the memory of a cluster so that it can be queried repeatedly. This in-memory computation makes Spark one of the most important component in big data computation world.

Features: Now let us discuss the features in brief. Apache Spark comes up with the following features:

APIs based on Java, Scala and Python.
Scalability ranging from 80 to 100 nodes.
Ability to cache dataset within the memory for interactive data set. E.g. extract a working set, cache it and query it repeatedly.
Efficient library for stream processing.
Efficient library for machine learning og graph processing.

While talking about Spark in the context of data science it is noticed that spark has the ability to maintain the resident data in the memory. This approach enhances the performance as compared to map reduce. Looking from the top, spark contains a driver program which runs the main method of the client and executes various operations in parallel mode on a clustered environment.

Spark provides resilient distributed dataset (RDD) which is a collection of elements which are distributed across the different nodes of cluster, slik at de kan bli utført i parallell. Gnist har evnen til å lagre en RDD i minnet, dermed slik at den kan gjenbrukes effektivt på tvers av parallell kjøring. RDDs kan også automatisk gjenopprette i tilfelle av nodefeil.

Gnist gir også delte variabler som brukes i parallelle operasjoner. Når gnist løper parallelt som et sett med oppgaver på forskjellige noder, det overfører en kopi av hver variabel til hver oppgave. Disse variablene er også delt på tvers av ulike oppgaver. I gnist har vi to typer delte variabler -

kringkaste variabler – brukes til å cache en verdi i minnet
akkumulatorer – brukes i tilfelle av tellere og summer.

konfigurering av Spark:

Spark gir tre hovedområder for konfigurasjon:

Spark egenskaper – Denne kontrollen meste av søknaden og kan bli satt av enten ved hjelp av SparkConf objekt eller ved hjelp av Java-systemegenskaper.
Miljøvariabler – Disse kan brukes til å konfigurere maskinbaserte innstillinger f.eks. ip-adresse ved hjelp av conf / spark-env.sh skript på hver enkelt node.
logging – Dette kan konfigureres ved hjelp av standard Log4j egenskaper.

Spark Properties: Spark egenskaper kontrollerer mesteparten av programinnstillingene og må konfigureres separat for egne programmer. Disse egenskapene kan stilles inn med SparkConf objektet og sendes til SparkContext. SparkConf tillater oss å konfigurere de fleste av de vanlige egenskapene til initial. Bruke settet () metode for SparkConf klasse kan vi også sette opp nøkkelverdi-par. En eksempelkode ved hjelp av settet () Fremgangsmåten er vist nedenfor -

Listing 1: Prøve viser Set metoden

Valget conf = ny SparkConf ()

. setMaster( “AWS” )

. setAppName( “Min Sample SPARK søknad” )

. sett( “spark.executor.memory” , “1g” )

val sc = new SparkContext (conf)

Some of the common properties are –
• spark.executor.memory – This indicates the amount of memory to be used per executor. •
• spark.serializer – Class used to serialize objects which will be sent over the network. Since the default java serialization is quite slow, it is recommended to use the org.apache.spark.serializer.JavaSerializer class to get a better performance.
• spark.kryo.registrator – Class used to register the custom classes if we use the Kyro serialization
• spark.local.dir – locations which spark uses as scratch space to store the map output files.
• spark.cores.max – Used in standalone mode to specify the maximum amount of CPU cores to request.

Miljøvariabler: Some of the spark settings can be configured using the environment variables which are defined in the conf/spark-env.sh script file. These are machine specific settings e.g. library search path, java path etc. Some of the commonly used environment variables are –

JAVA_HOME – Location where JAVA is installed on your system.
PYSPARK_PYTHON – The python library used for PYSPARK.
SPARK_LOCAL_IP – IP address of the machine which is to be bound.
SPARK_CLASSPATH – Used to add the libraries which are used at runtime to execute.
SPARK_JAVA_OPTS – Used to add the JVM options

logging: Spark uses the standard Log4j API for logging which can be configured using the log4j. properties file.

Initializing Spark:

To start with a spark program, the first thing is to create a JavaSparkContext object, which tells spark to access the cluster. To create a spark context we first create spark conf object as shown below:

Listing 2: Initializing the spark context object

SparkConfconfig=newSparkConf().setAppName(applicationName).setMaster(master);

JavaSparkContextconext=newJavaSparkContext(config);

The parameter applicationName is the name of our application which is shown on the cluster UI. The parameter master is the cluster URL or a local string used to run in local mode.

Resilient Distributed Datasets (RDDs):

Spark is based on the concept of resilient distributed dataset or RDD. RDD is a fault-tolerant collection of elements which can be operated in parallel. RDD can be created using either of the following two manners:

By Parallelizing an existing collection – Parallelized collections are created by calling the parallelize method of the JavaSparkContext class in the driver program. Elements of the collection are copied from an existing collection which can be operated in parallel.
By Referencing the dataset on an external storage system – Spark has the ability to create distributed datasets from any Hadoop supported storage space e.g. HDFS, Cassendra, Hbase etc.

RDD Operations:

RDD supports two types of operations –

Transformations – Used to create new datasets from an existing one.
Actions – This returns a value to the driver program after executing the code on the dataset.

In RDD the transformations are lazy. Transformations do not compute their results right away. Rather they just remember the transformations which are applied to the base datasets.

Summary: So in the above discussion I have explained different aspects of Apache SPARK framework and its implementation. Ytelsen til SPARK enn normal MapReduce jobb er også en av de viktigste aspektene vi bør forstå klart.

Let us conclude our discussion in the following bullets:

Spark er et rammeverk presentert av Apache som leverer høy ytelse søkemotoren som brukes til å behandle store omfanget av data.
Utviklet på toppen av HDFS, men det trenger ikke bruke kartet redusere paradigmet.
Spark løfter 100 ganger raskere ytelse enn Hadoop.
Spark beste utfører på klynger.
Gnist kan skalere opp til en rekke 80 to 100 nodes.
Spark har evnen til å cache datasettene
Spark kan konfigureres ved hjelp av et eiendommer fil og noen miljøvariabler.
Spark er basert på elastiske distribuerte datasett (RDD) som er en samling av feiltolerant objekter.

Share on Facebook

Save

Tagged on: Apache Spark

TechAlpine – All About Technology

www.techalpine.com