Taw qhia rau Apache txim cov lus piv ntxwv thiab kev siv

Apache Spark

Apache txim

LOS NTAWM RADEK OSTROWSKI – KWS UA CHOJ @ TOPTAL SOFTWARE

Kuv xub hnov ntawm lub txim nyob lig 2013 thaum kuv pib xav Scala, cov lus uas txim tim. Tej tsam lwm hnub, Puas muaj kev lom zem kuv cov ntaub ntawv kev kawm ib qhov rau toov ciaj sia taus rau tus Titanic. Qhov no muab kom tau ib txoj kev zoo mus ntxiv ua tswvcuab txim tswvyim thiab programming. Kuv mas kom nws rau tej aspiring txim developers nrhiav ib qho chaw kom pib.








Hnub no, Txim yog tau txais yuav los ntawm loj players li Amazon, eBay, thiab Yahoo! Cov koom haum khiav txim rau tej pawg ua ke nrog ntshav txhiab. Raws li lub txim FAQ, paub tias cov pawg coob muaj dua 8000 o. Tseeb, Txim yog ib lub tshuab zoo tsim nyog kev ceeb toom txog thiab kev kawm txog.

Image1: Apache txim

Qhov tsab xov xwm no qhia cov kev taw qhia rau txim xws li tus neeg mob siv thiab piv txwv. Nws muaj cov lus qhia ntawm lub txim Apache website, thiab cov phau ntawv Txim kawm – Xob laim-vas nthiv loj Data Analysis.

Apache txim yog dab tsi? Qhov taw qhia

Txim yog ib qhov project Apache advertised tias "xob yoo sawv xam". Nws muaj ib thriving qhib qhov tuab thiab yog ib qhov zoo tshaj cum Apache thaum lub caij.

Txim muab ib lub platform ceev thiab dav dua cov ntaub ntawv ua. Txim cia koj khiav cov kev pab cuam mus txog 100 x sai nco qas ntsoov, los yog 10 x sai rau disk, tshaj li Hadoop. Tsaib no, Txim coj lawm Hadoop yog koj ua tiav cov 100 TB Daytona GraySort contest 3 x sai rau caum ib cov cav tov thiab nws kuj ua rau ceev tshaj cav qhib tau qhov twg los kev sorting ib petabyte.

Txim kuj yuav ua rau nws tau sau ntawv cai sai raws li koj muaj dua 80 has tswv ntawm koj pov tseg. Nqai no, wb sib "nyob zoo ntiaj teb saib!"ntawm BigData: cov suav lo lus piv txwv. Sau Java rau MapReduce muaj ib ncig 50 ntawm txoj kab, whereas rau txim (thiab Scala) koj muaj peev xwm ua kom cias li no:

sparkContext.textFile(“hdfs://…”)

.flatMap(kab => line.split(” “))

.daim ntawv qhia(lo lus => (lo lus, 1)).reduceByKey(_ + _)

.saveAsTextFile(“hdfs://…”)

Lwm ceeb nam thaum siv Apache txim tej hau kev kawm yog lub plhaub sib tham sib (REPL) uas nws siv tawm-txog-lub thawv. Siv REPL, ib sim yuav hlav chaws txhua lub sij hawm tsis muaj thawj hu ua deductible code thiab coj cov hauj lwm tag nrho. Yog li luv npaum li cas mus ua hauj lwm txoj kev yog thiab ad-hoc tsom xam cov ntaub ntawv no ua tau.

Ntxiv tseem ceeb nta cov txim muaj xws li:

  • Tam sim no uas muaj APIs hauv Scala, Java, thiab nab hab sej, nrog kev them nyiaj yug rau lwm haiv lus (xws li R) tus
  • Integrates nrog cov Hadoop ecosystem thiab cov kev pab zoo (HDFS, Amazon S3, Nas muv, HBase, Cassandra, thiab lwm yam.)
  • Yuav khiav tau nyob tej pawg ua ke tswj los ntawm cov xov paj Hadoop los yog Apache Mesos, thiab yuav tau khiav standalone

Cov tub ntxhais txim yog complemented los txheej haib, higher-level qiv uas tau seamlessly muab sau rau hauv daim ntawv thov tib. Tam sim no qiv muaj SparkSQL, Txim Streaming, MLlib (rau cov kev kawm tshuab), thiab GraphX, txhua yam uas yog ntxiv xav paub ntxiv hauv no Tshooj. Ntxiv cov txim qiv thiab extensions uas tam sim no nyob rau hauv txoj kev loj hlob zoo li.

Spark libraries

Txim qiv

Image2: Apache txim qiv

Tub ntxhais txim

Tub ntxhais txim yog lub cav luj large-scale mus tib seem thiab distributed cov ntaub ntawv ua. Nws yog lub luag hauj lwm rau:

  • nco kev tswj thiab kev txhaum thiaj rov qab
  • teem dua, distributing thiab xyuas hauj lwm rau sawv
  • interacting uas cia lub nruab nrog

Txim introduces lub tswvyim ntawm tus RDD (Resilient faib Dataset), tus immutable txhaum-tiv thaiv, distributed sau los ntawm cov khoom uas yuav tau ua nyob rau hauv mus tib seem. Tus RDD yuav muaj tej yam twj paj nruas thiab yog tsim los loading ib dataset lwm distributing ib phau ntawm cov neeg tsav tsheb.

RDDs yug ob hom haujlwm:

  • Transformationscov haujlwm (xws li daim ntawv qhia, lim, sib sau, pab neeg, thiab hais txog) uas tau muaj kev RDD thiab ib tug tshiab RDD uas muaj cov paib uas.
  • Tej yam ua taucov haujlwm (xws li kom tsis txhob, suav, thawj, thiab hais txog) uas rov qab muaj nqis tom qab khiav ib le caag nyob ib RDD.

Transformations kev rau txim yog "tubnkeeg", lub ntsiab lus uas lawv tsis laij cov ntsiab tam. Xwb, lawv nyuam qhuav "nco qab" yuav tau muab lub lag luam thiab cov dataset (xws li, cov ntaub ntawv) rau lub lag luam no yog tau. Tus transformations no xwb yeej xoo thaum luag hu ua thiab cov rov qab mus rau qhov kev pab cuam tsav. No tsim enables txim ntau nraaj khiav. Piv txwv, yog ib daim ntawv loj loj heev transformed muaj ntau txoj kev uas ua tau mus thawj ua, Txim yuav tsuas ntaub ntawv thiab xa rov qab mus qhov tshwm sim rau ntawm thawj kab, es tsis ua haujlwm rau tus ua ntaub ntawv thov tag nrho.

Yog vim, txhua transformed RDD yuav tsum recomputed txhua lub sij hawm koj khiav ib qho rau nws. Txawm li cas los, tej zaum koj kuj mob ib RDD nco qas ntsoov siv txoj kev persist los yog cache, nyob rau hauv uas txim ntawd tseem yuav tseg lub ntsiab nyob rau cov pawg npaum sai nkag rau lwm zaus koj query.








SparkSQL

SparkSQL yog tus txim tivthaiv ua txhawb tau lub querying tau ntaub ntawv xws li ntawm SQL los yog ntawm tus Nas muv lus nug lus. Nws originated ua tus Apache Hive chaw nres nkoj khiav saum txim (sawv cev ntawm MapReduce) thiab tam sim no tau kev nrog lub txim pawg. Ntxiv rau kev pab txhawb kom muaj ntau yam kev pab cov ntaub ntawv, nws ua rau nws tau hiab SQL queries nrog transformations chaws uas ua rau ib tug uas haib heev. Hauv qab no yog ib qho piv txwv ntawm tus nas muv tau tshaj lus nug:

// sc yog ib SparkContext uas twb muaj lawm.

val sqlContext = org.apache.spark.sql.hive.HiveContext tshiab(sc)

sqlContext.sql(“SAU cov lus tsis yog tshwm sim src (qhov tseem ceeb rau cov menyuam, tus nqi hlua)”)

sqlContext.sql(“LOAD cov ntaub ntawv hauv zos INPATH ' examples/src/main/resources/kv1.txt’ Rau hauv lub rooj src”)

// Queries muaj expressed hauv HiveQL

sqlContext.sql(“NTAWM src xaiv qhov tseem ceeb, tus nqi”).sau().foreach(println)

Txim Streaming

Txim Streaming txhawb kev ua ntawm lub sij hawm uas tau streaming ntaub ntawv, xws li ntau lawm web neeg rau zaub mov cav ntaub ntawv (e.g. Apache Flume thiab HDFS/S3), kev tawm li Twitter, thiab tam ntau queues nyiam Kafka. Nyob rau hauv cov hood, Txim tau Streaming tau txais cov ntaub ntawv input ntws thiab divides cov ntaub ntawv rau hauv cooling. Tom ntej, lawv tau txais los ntawm lub txim cav ua thiab kwj kawg uas tau tsim nyob rau hauv cooling, raws li depicted qab no.

Spark streaming

Txim streaming

Duab 3: Txim streaming thiab cav

Lub txim Streaming API siv qhov khoom uas tus tub ntxhais txim, ua rau programmers ua hauj lwm hauv lub worlds ob batch thiab streaming tej ntaub ntawv yooj yim.








MLlib

MLlib yog ib tshuab kev kawm ntawv uas muaj ntau yam algorithms tsim los teev tseg rau ib pawg rau kev faib tawm, regression, clustering, thiab cob qha filtering, thiab hais txog. Tej cov algorithms tseem ua hauj lwm uas tau streaming ntaub ntawv, xws li tawm siv zoo tib yam tsawg tshaj plaws squares lossis k-txhais tau tias clustering regression (thiab ntxiv rau qhov uas). Apache Mahout (lub tsev qiv ntawv kev kawm tshuab rau Hadoop) raug twb xees nres tam sim ntawd ntawm MapReduce thiab koom rau txim MLlib.

 

============================================= ============================================== Yuav zoo TechAlpine phau ntawv rau Amazon
============================================== ---------------------------------------------------------------- electrician ct chestnutelectric
error

Txaus siab rau qhov blog? Tshaj tawm lus thov :)

Follow by Email
LinkedIn
LinkedIn
Share