What is Apache Spark?

Visão global: Apache spark is a high performance general engine used to process large scale data. It is an open source framework used for cluster computing. Aim of this framework is to make the data analytic faster – both in terms of development and execution. In this document, I will talk about Apache Spark and discuss the various aspects of this framework.

Introdução: Apache spark is an open source framework for cluster computing. It is built on top of the Hadoop Distributed File System (HDFS). It doesn’t use the two stage map reduce paradigm. But at the same time it promises up to 100 times faster performance for certain applications. Spark also provides the initial leads for cluster computing within the memory. This enables the application programs to load the data into the memory of a cluster so that it can be queried repeatedly. This in-memory computation makes Spark one of the most important component in big data computation world.

Features: Now let us discuss the features in brief. Apache Spark comes up with the following features:

APIs based on Java, Scala and Python.
Scalability ranging from 80 to 100 nodes.
Ability to cache dataset within the memory for interactive data set. E.g. extract a working set, cache it and query it repeatedly.
Efficient library for stream processing.
Efficient library for machine learning e graph processing.

While talking about Spark in the context of data science it is noticed that spark has the ability to maintain the resident data in the memory. This approach enhances the performance as compared to map reduce. Looking from the top, spark contains a driver program which runs the main method of the client and executes various operations in parallel mode on a clustered environment.

Spark provides resilient distributed dataset (RDD) que é uma coleção de elementos que são distribuídos entre os diferentes nós de clusters, de modo que possam ser executadas em paralelo. Faísca tem a capacidade de armazenar uma RDD na memória, permitindo assim que ele seja reutilizado eficientemente através da execução paralela. RDDS também pode recuperar automaticamente em caso de falha do nó.

Faísca também fornece variáveis compartilhadas que são utilizados em operações paralelas. Quando faísca corre em paralelo como um conjunto de tarefas em diferentes nós, ele transfere uma cópia de cada variável para cada tarefa. Essas variáveis também são compartilhados entre diferentes tarefas. Em faísca temos dois tipos de variáveis compartilhadas -

transmitir variáveis – usado para armazenar um valor na memória
acumuladores – usado em caso de contadores e somas.

Configurando faísca:

Faísca oferece três áreas principais para a configuração:

Spark properties – This control most of the application and can be set by either using the SparkConf object or with the help of Java system properties.
Environment Variables – These can be used to configure machine based settings e.g. ip address with the help of conf/spark-env.sh script on every single node.
Logging – This can be configured using the standard log4j properties.

Spark Properties: Spark properties control most of the application settings and should be configured separately for separate applications. These properties can be set using the SparkConf object and is passed to the SparkContext. SparkConf allows us to configure most of the common properties to initialize. Using the set () method of SparkConf class we can also set key value pairs. A sample code using the set () method is shown below –

Listing 1: Amostra mostrando o método Set

escolha conf = new SparkConf ()

. setMaster( “aws” )

. setAppName( “Meu aplicativo SPARK Amostra” )

. conjunto( “spark.executor.memory” , “1g” )

selecção sc = new SparkContext (conf)

Algumas das propriedades comuns são –
• spark.executor.memory – Isso indica a quantidade de memória para ser usado por executor. •
• spark.serializer – Classe usada para serializar objetos que serão enviados através da rede. Desde a serialização java padrão é bastante lento, Recomenda-se a utilização da classe org.apache.spark.serializer.JavaSerializer para obter um melhor desempenho.
• spark.kryo.registrator – Classe usada para registrar as classes personalizadas, se usarmos a serialização Kyro
• spark.local.dir – locais que faísca usa como espaço temporário para armazenar os arquivos de mapa de saída.
• spark.cores.max – Used in standalone mode to specify the maximum amount of CPU cores to request.

Environment Variables: Some of the spark settings can be configured using the environment variables which are defined in the conf/spark-env.sh script file. These are machine specific settings e.g. library search path, java path etc. Some of the commonly used environment variables are –

JAVA_HOME – Location where JAVA is installed on your system.
PYSPARK_PYTHON – The python library used for PYSPARK.
SPARK_LOCAL_IP – IP address of the machine which is to be bound.
SPARK_CLASSPATH – Used to add the libraries which are used at runtime to execute.
SPARK_JAVA_OPTS – Used to add the JVM options

Logging: Spark uses the standard Log4j API for logging which can be configured using the log4j. properties file.

Initializing Spark:

To start with a spark program, a primeira coisa é criar um objeto JavaSparkContext, que conta faísca para acessar o cluster. Para criar um contexto de ignição primeiro criamos objeto faísca conf como mostrado abaixo:

Listing 2: Inicializar o objeto de contexto faísca

SparkConfconfig = newSparkConf().setAppName(ApplicationName).setMaster(dominar);

JavaSparkContextconext = newJavaSparkContext(configuração);

O applicationName parâmetro é o nome do nosso aplicativo que é mostrado na interface do usuário de cluster. O mestre parâmetro é o URL de cluster ou uma cadeia de local usada para executar em modo local.

Os conjuntos de dados distribuídos resilientes (RDDS):

Faísca é baseado no conceito de conjunto de dados distribuída resiliente ou RDD. RDD é um conjunto tolerante a falhas de elementos que podem ser operados em paralelo. RDD pode ser criado utilizando um dos seguintes dois modos:

Por paralelização de uma coleção existente – Parallelized collections are created by calling the parallelize method of the JavaSparkContext class in the driver program. Elements of the collection are copied from an existing collection which can be operated in parallel.
By Referencing the dataset on an external storage system – Spark has the ability to create distributed datasets from any Hadoop supported storage space e.g. HDFS, Cassendra, Hbase etc.

RDD Operations:

RDD supports two types of operations –

Transformations – Used to create new datasets from an existing one.
Actions – This returns a value to the driver program after executing the code on the dataset.

In RDD the transformations are lazy. Transformations do not compute their results right away. Rather they just remember the transformations which are applied to the base datasets.

Summary: So in the above discussion I have explained different aspects of Apache SPARK framework and its implementation. The performance of SPARK over normal MapReduce job is also one of the most important aspects we should understand clearly.

Let us conclude our discussion in the following bullets:

Spark is a framework presented by Apache which delivers high performance search engine used to process large scale of data.
Developed on top of HDFS, but it does not use the map reduce paradigm.
Spark promises 100 times faster performance than Hadoop.
Spark best performs on clusters.
Spark can scale up to a range of 80 to 100 nodes.
Spark has the ability to cache the datasets
Spark can be configured with the help of a properties file and some environment variables.
Spark is based on resilient distributed datasets (RDD) which is a collection of fault tolerant objects.

Share on Facebook

Save

Tagged on: Apache Spark

TechAlpine – All About Technology

www.techalpine.com