Introduction to Apache Spark with Examples and Use Cases

Apache Spark

Apache Spark

BY Radek Ostrowski – Lojisyèl ENGINEER @ TOPTAL

M 'premye tande pale de Etensèl nan fen mwa 2013 lè m 'te vin enterese nan Scala, lang nan nan ki se Spark ekri. Nenpòt moman pita, Mwen te fè yon pwojè syans done plezi ap eseye predi siviv sou Titanik lan. Sa a vire soti nan ka yon bon fason a jwenn plis prezante yo Etensèl konsèp ak pwogram. Mwen trè rekòmande pou li pou nenpòt ki devlopè Spark aspiran kap chèche yon kote yo jwenn te kòmanse.








Today, se Spark te adopte pa gwo jwè tankou Amazon, be, ak Yahoo! Anpil òganizasyon kouri Etensèl sou grap ak dè milye de nœuds. Dapre FAQ la Etensèl, pi gwo gwoup la li te ye gen plis pase 8000 nodes. Vreman vre, Etensèl se yon teknoloji byen vo pran nòt nan ak aprann sou.

Image1: Apache Spark

Atik sa a bay yon entwodiksyon sou Etensèl ki gen ladan ka itilize ak egzanp. Li genyen ladan li enfòmasyon ki soti nan sit entènèt la Etensèl Apache osi byen ke liv la Aprann Spark – Zeklè-vit Big Analiz done.

What is Apache Spark? Yon Entwodiksyon

Spark se yon pwojè Apache pibliye kòm "zèklè vit gwoup informatique". Li te gen yon pwospere kominote louvri-sous ak se pi aktif Apache pwojè a nan moman sa a.

Etensèl bay yon pi vit ak plis done jeneral platfòm pwosesis. Etensèl pèmèt ou kouri pwogram jiska 100x pi vit nan memwa, oswa 10x pi vit sou ki gen kapasite, pase Hadoop. Ane pase, Etensèl te pran plis pase Hadoop lè ou konplete a 100 TB Daytona GraySort konpetisyon 3x pi vit sou yon dizyèm nimewo a nan machin epi li tou te vin nan pi rapid motè sous louvri pou klasman yon petabyte.

Spark also makes it possible to write code more quickly as you have over 80 high-level operators at your disposal. To demonstrate this, let’s have a look at the “Hello World!” of BigData: the Word Count example. Written in Java for MapReduce it has around 50 lines of code, whereas in Spark (and Scala) you can do it as simply as this:

sparkContext.textFile(“hdfs://…”)

.flatMap(line => line.split(” “))

.map(word => (word, 1)).reduceByKey(_ + _)

.saveAsTextFile(“hdfs://…”)

Another important aspect when learning how to use Apache Spark is the interactive shell (REPL) which it provides out-of-the box. Using REPL, one can test the outcome of each line of code without first needing to code and execute the entire job. The path to working code is thus much shorter and ad-hoc data analysis is made possible.

Additional key features of Spark include:

  • Kounye a bay APIs nan Scala, Java, and Python, ak sipò pou lòt lang (tankou R) sou wout la
  • Entegre byen ak Hadoop ekosistèm ak done sous yo (HDFS, Amazon S3, Hive, HBase, Cassandra, elatriye)
  • Ka kouri sou grap jere pa Hadoop FILS oswa Apache Mesos, epi li ka tou kouri otonòm

Nwayo a Spark se pyese pa yon seri pwisan, bibliyotèk pi wo-nivo ki ka transparans yo itilize nan aplikasyon an menm. bibliyotèk sa yo kounye a gen ladan SparkSQL, Etensèl Streaming, MLlib (pou aprann machin), ak GraphX, chak nan ki se plis an detay nan atik sa a. bibliyotèk Spark Lòt ak ekstansyon se kounye a anba devlòpman osi byen.

Spark libraries

bibliyotèk Spark

Image2: Apache Spark bibliyotèk

Etensèl Nwayo

Etensèl Nwayo se motè a baz pou gwo-echèl paralèl ak done distribye pwosesis. Li se responsab pou:

  • jesyon memwa ak rekiperasyon fòt
  • scheduling, distributing and monitoring jobs on a cluster
  • interacting with storage systems

Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. An RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.

RDDs support two types of operations:

  • Transformationsare operations (such as map, filtre, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result.
  • Actionsare operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.

Transformations in Spark are “lazy”, meaning that they do not compute their results right away. Instead, they just “remember” the operation to be performed and the dataset (e.g., dosye) to which the operation is to be performed. The transformations are only actually computed when an action is called and the result is returned to the driver program. This design enables Spark to run more efficiently. For example, if a big file was transformed in various ways and passed to first action, Spark would only process and return the result for the first line, rather than do the work for the entire file.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist or cache method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.








SparkSQL

SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language. Li soti kòm pò a Apache ruch a kouri sou tèt Spark (nan plas MapReduce) ak se kounye a entegre ak chemine a Spark. Anplis bay sipò pou sous done divès kalite, li fè li posib yo mare demann SQL ak transfòmasyon Kòd ki rezilta yo nan yon zouti trè pwisan. Anba la a se yon egzanp sou yon sijè rechèch ruch konpatib:

// sc se yon SparkContext ki deja egziste.

Val sqlContext = nouvo org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql(“KREYE TAB SI pa egziste src (key INT, value STRING)”)

sqlContext.sql(“egzanp LOAD DONE LOCAL INPATH '/ src / prensipal / resous / kv1.txt’ INTO TABLE src”)

// Demann yo eksprime nan HiveQL

sqlContext.sql(“KI SOTI NAN src chwazi kle, valè”).kolekte().foreach(println)

Etensèl Streaming

Etensèl Streaming sipòte pwosesis tan reyèl nan done difizyon, tankou dosye boutèy demi lit pwodiksyon sèvè sit entènèt (e.g. Apache Flume ak HDFS / S3), medya sosyal tankou Twitter, ak divès kalite ke moun kap kriye messagerie tankou Kafka. Anba kapo machin lan, Etensèl Streaming resevwa sous dlo yo done opinyon ak divize done yo nan lo. Next, yo jwenn trete pa motè a Spark ak jenere kouran final la nan rezilta nan lo, kòm repwezante anba a.

Spark streaming

Etensèl difizyon

Imaj 3: Etensèl difizyon ak motè

Etensèl Streaming API la byen matche ak sa yo ki an nwayo a Spark, fè li fasil pou pwogramasyon nan travay nan mond yo nan tou de pakèt ak difizyon done.








MLlib

MLlib se yon aprann bibliyotèk machin ki bay algoritm divès kalite ki fèt selon echèl la soti sou yon gwoup pou klasifikasyon, retou annaryè, clustering, ak filtraj tèt ansanm, and so on. Gen kèk nan algoritm sa yo tou travay ak done difizyon, tankou retou annaryè lineyè lè l sèvi avèk òdinè pi piti kare oswa k-vle di clustering (ak plis ankò sou wout la). Apache Mahout (yon machin aprann bibliyotèk pou Hadoop) te deja vire do ba MapReduce ak mete tèt ansanm sou Spark MLlib.

 

============================================= ============================================== Buy best TechAlpine Books on Amazon
============================================== ---------------------------------------------------------------- electrician ct chestnutelectric
error

Enjoy this blog? Please spread the word :)

Follow by Email
LinkedIn
LinkedIn
Share