What is Apache Spark?

Pregled: Apache spark is a high performance general engine used to process large scale data. It is an open source framework used for cluster computing. Aim of this framework is to make the data analytic faster – both in terms of development and execution. In this document, I will talk about Apache Spark and discuss the various aspects of this framework.

Predstavitev: Apache spark is an open source framework for cluster computing. It is built on top of the Hadoop Distributed File System (HDFS). It doesn’t use the two stage map reduce paradigm. But at the same time it promises up to 100 times faster performance for certain applications. Spark also provides the initial leads for cluster computing within the memory. This enables the application programs to load the data into the memory of a cluster so that it can be queried repeatedly. This in-memory computation makes Spark one of the most important component in big data computation world.

Features: Now let us discuss the features in brief. Apache Spark prihaja z naslednjimi značilnostmi:

API-ji, ki temelji na Java, Scala in Python.
Razširljivost in sicer od 80 to 100 vozlišča.
Sposobnost predpomnilnika nabor podatkov v pomnilniku za interaktivno nabora podatkov. E.g. ekstrakt delovno niz, je predpomnilnik in poizvedbe večkrat.
Učinkovito knjižnica za obdelava tok.
Učinkovito knjižnica za strojno učenje in obdelava graf.

Ko govorimo o Spark v okviru podatkov, znanosti, je opazil, da ima iskra sposobnost ohranjanja podatkov prebivališčem v pomnilniku. Ta pristop izboljša učinkovitost v primerjavi z zemljevida zmanjšanje. Gledano od vrha, iskra vsebuje program gonilnika, ki teče glavni način stranko in izvaja različne operacije v vzporednem načinu na okolju z gručami.

Spark zagotavlja odporno razdeli nabor podatkov (eet) which is a collection of elements which are distributed across the different nodes of cluster, so that they can be executed in parallel. Spark has the ability to store an RDD in the memory, thus allowing it to be reused efficiently across parallel execution. RDDs can also automatically recover in case of the node failure.

Spark also provides shared variables which are used in parallel operations. When spark runs in parallel as a set of tasks on different nodes, it transfers a copy of each variable to every task. These variables are also shared across different tasks. In spark we have two types of shared variables –

broadcast variables – used to cache a value in memory
accumulators – used in case of counters and sums.

Configuring Spark:

Spark provides three main areas for configuration:

Spark properties – This control most of the application and can be set by either using the SparkConf object or with the help of Java system properties.
Environment Variables – These can be used to configure machine based settings e.g. ip address with the help of conf/spark-env.sh script on every single node.
Logging – This can be configured using the standard log4j properties.

Spark Properties: Spark properties control most of the application settings and should be configured separately for separate applications. These properties can be set using the SparkConf object and is passed to the SparkContext. SparkConf allows us to configure most of the common properties to initialize. Using the set () method of SparkConf class we can also set key value pairs. A sample code using the set () method is shown below –

Listing 1: Vzorec, ki prikazuje način Set

Izbira conf = new SparkConf ()

. setMaster( “AWS” )

. setAppName( “Moja prijava Vzorec SPARK” )

. set( “spark.executor.memory” , “1g” )

Izbira sc = new SparkContext (conf)

Nekaj skupnih značilnosti so –
• spark.executor.memory – To kaže na količino pomnilnika, ki se uporablja za izvršitelja. •
• spark.serializer – Razred uporablja serializira predmete, ki bodo poslane prek omrežja. Ker je privzeta java serijsko je zelo počasna, je priporočljivo uporabiti razred org.apache.spark.serializer.JavaSerializer, da bi dobili boljše rezultate.
• spark.kryo.registrator – Razred se uporablja za registracijo razrede po meri, če bomo uporabili Kyro serijsko
• spark.local.dir – Lokacije, ki iskra uporablja kot nič prostora za shranjevanje map izhodne datoteke.
• spark.cores.max – Used in standalone mode to specify the maximum amount of CPU cores to request.

Environment Variables: Some of the spark settings can be configured using the environment variables which are defined in the conf/spark-env.sh script file. These are machine specific settings e.g. library search path, java path etc. Some of the commonly used environment variables are –

JAVA_HOME – Location where JAVA is installed on your system.
PYSPARK_PYTHON – The python library used for PYSPARK.
SPARK_LOCAL_IP – IP address of the machine which is to be bound.
SPARK_CLASSPATH – Used to add the libraries which are used at runtime to execute.
SPARK_JAVA_OPTS – Used to add the JVM options

Logging: Spark uses the standard Log4j API for logging which can be configured using the log4j. properties file.

Initializing Spark:

To start with a spark program, the first thing is to create a JavaSparkContext object, which tells spark to access the cluster. To create a spark context we first create spark conf object as shown below:

Listing 2: Initializing the spark context object

SparkConfconfig=newSparkConf().setAppName(applicationName).setMaster(master);

JavaSparkContextconext=newJavaSparkContext(config);

The parameter applicationName is the name of our application which is shown on the cluster UI. The parameter master is the cluster URL or a local string used to run in local mode.

Resilient Distributed Datasets (RDDs):

Spark is based on the concept of resilient distributed dataset or RDD. RDD is a fault-tolerant collection of elements which can be operated in parallel. RDD can be created using either of the following two manners:

By Parallelizing an existing collection – Paralelizirano zbirke, ki kliče metodo Paraleliziranje razreda JavaSparkContext v programu voznika ustvaril. Elementi zbirke so prepisane iz obstoječe zbirke, ki jo je mogoče upravljati vzporedno.
S sklicevanjem na nabor podatkov o zunanji sistem za shranjevanje – Spark ima možnost, da ustvarijo porazdeljenih podatkovnih nizov iz katerega koli Hadoop podprtega prostora za shranjevanje npr. HDFS, Cassendra, Hbase etc.

EET Operations:

RRP podpira dve vrsti operacij –

transformacije – Uporablja se za ustvarjanje novih podatkovnih nizov iz obstoječega.
dejanja – Ta vrne vrednost programa voznika po izvršitvi kodo na CCD.

V RRP so transformacije leni. Transformacije ne izračuna svoje rezultate takoj. Ampak se samo spomnimo preobrazbe, ki se uporabljajo pri osnovnih podatkovnih bazah.

Summary: Torej v zgornje razprave sem razložil različne vidike okviru Apache SPARK in njeno izvajanje. Delovanje SPARK nad normalno MapReduce delovno mesto, je tudi eden od najpomembnejših vidikov moramo jasno razumeti.

Let us conclude our discussion in the following bullets:

Spark je okvir, ki ga Apache predstavljen, ki zagotavlja visoko zmogljivost iskalnik, ki se uporablja za obdelavo velikega obsega podatkov.
Razvit na vrhu HDF, vendar se ne uporablja zemljevida zmanjšanje paradigmo.
Spark obljube 100 krat hitreje zmogljivost kot Hadoop.
Spark najuspešnejša o grozdih.
Spark lahko obsega do številnih 80 to 100 vozlišča.
Spark ima sposobnost, da predpomnilnika nabore podatkov
Spark je mogoče konfigurirati s pomočjo datoteke lastnosti in nekaj spremenljivk okolja.
Spark je na osnovi elastičnih porazdeljenih podatkovnih bazah (eet) ki je zbirka napak odpornih objektov.

Share on Facebook

Save

Tagged on: Apache Spark

TechAlpine – All About Technology

www.techalpine.com