Cysyniadau Hadoop MapReduce

What do you mean by Map-Reduce programming?

MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.

The MapReduce programming model is inspired by functional languages and targets data-intensive computations. The input data format is application-specific, and is specified by the user. The output is a set of <key,gwerth> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,gwerth> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, ei ddarparu gan y runtime.

Cyfnodau o fodel MapReduce

Mae'r uned lefel uchaf o waith mewn MapReduce yn swydd. Fel arfer, mae gan Swydd map ac yn lleihau cam, er y gall y cam leihau ei hepgor. For example, ystyried swydd MapReduce sy'n cyfrif y nifer o weithiau bob gair yn cael ei ddefnyddio ar draws set o ddogfennau. Mae'r cam map cyfrif y geiriau ym mhob dogfen, yna bydd y lleihau cam agregau data fesul-dogfen i mewn i gyfrif gair ymestyn dros y casgliad cyfan.

Yn ystod y cyfnod map, mae'r data mewnbwn wedi ei rhannu'n rhaniadau mewnbwn ar gyfer dadansoddiad gan dasgau map yn rhedeg yn gyfochrog ar draws y clwstwr Hadoop. Yn ddiofyn, y fframwaith MapReduce cael data mewnbwn gan y Hadoop System Ffeil Ddosbarthwyd (HDFS).

Y cam lleihau yn defnyddio canlyniadau o dasgau map fel mewnbwn i set o leihau tasgau cyfochrog. The reduce tasks consolidate the data into final results. Yn ddiofyn, the MapReduce framework stores results in HDFS.

Although the reduce phase depends on output from the map phase, map and reduce processing is not necessarily sequential. That is, reduce tasks can begin as soon as any map task completes. It is not necessary for all map tasks to complete before any reduce task can begin.

MapReduce operates on key-value pairs. Conceptually, a MapReduce job takes a set of input key-value pairs and produces a set of output key-value pairs by passing the data through map and reduces functions. The map tasks produce an intermediate set of key-value pairs that the reduce tasks uses as input.

The keys in the map output pairs need not be unique. Between the map processing and the reduce processing, a shuffle step sorts all map output values with the same key into a single reduce input (key, value-list) pair, where the ‘value’ is a list of all values sharing the same key. Thus, the input to a reduce task is actually a set of (key, value-list) pairs.

Though each set of key-value pairs is homogeneous, the key-value pairs in each step need not have the same type. For example, the key-value pairs in the input set (KV1) can be (string, string) pairs, with the map phase producing (string, integer) pairs as intermediate results (KV2), and the reduce phase producing (integer, string) pairs for the final results (KV3).

Example demonstrating MapReduce concepts

The example demonstrates basic MapReduce concept by calculating the number of occurrence of each word in a set of text files.

The MapReduce input data is divided into input splits, and the splits are further divided into input key-value pairs. In this example, the input data set is the two documents, document1 and document2. The InputFormat subclass divides the data set into one split per document, for a total of 2 splits:

Note: The MapReduce framework divides the input data set into chunks called splits using the org.apache.hadoop.mapreduce.InputFormat subclass supplied in the job configuration. Splits are created by the local Job Client and included in the job information made available to the Job Tracker. The JobTracker creates a map task for each split. Each map task uses a RecordReader provided by the InputFormat subclass to transform the split into input key-value pairs.

Mae (line number, text) key-value pair is generated for each line in an input document. The map function discards the line number and produces a per-line (word, count) pair for each word in the input line. The reduce phase produces (word, count) pairs representing aggregated word counts across all the input documents. Given the input data shown the map-reduce progression for the example job is:

The output from the map phase contains multiple key-value pairs with the same key: The ‘oats’ and ‘eat’ keys appear twice. Recall that the MapReduce framework consolidates all values with the same key before entering the reduce phase, so the input to reduce is actually (key, values) pairs. Therefore, the full progression from map output, through reduce, to final results is shown above.

MapReduce Job Life Cycle

Following is the life cycle of a typical MapReduce job and the roles of the primary actors.The full life cycle are more complex so here we will concentrate on the primary components.

Gall y cyfluniad Hadoop yn cael ei wneud mewn gwahanol ffyrdd, ond mae'r cyfluniad sylfaenol yn cynnwys y canlynol.

meistr nod sengl yn rhedeg Tracker Swyddi
nodau gweithiwr Lluosog rhedeg Tasg Tracker

Canlynol yn y cydrannau cylch bywyd swydd MapReduce.

cleient Waith leol: Mae'r Cleient swyddi lleol yn paratoi y swydd ar gyfer cyflwyno a dwylo i ffwrdd i'r Tracker Swydd.
Tracker swydd: amserlenni Mae'r Tracker Swydd y swydd ac yn lledaenu'r gwaith map ymhlith y Trackers Tasg ar gyfer prosesu cyfochrog.
tasg Tracker: Mae pob Tracker Tasg spawns Tasg Map. Mae'r Tracker Swyddi yn derbyn gwybodaeth oddi wrth y cynnydd Trackers Tasg.

Unwaith y bydd y canlyniadau map ar gael, y Tracker Swydd dosbarthu'r lleihau'r gwaith ymhlith y Trackers Tasg ar gyfer prosesu cyfochrog.

Mae pob Tracker Tasg spawns Lleihau Tasg i wneud y gwaith. Mae'r Tracker Swyddi yn derbyn gwybodaeth oddi wrth y cynnydd Trackers Tasg.

All map tasks do not have to complete before reduce tasks begin running. Reduce tasks can begin as soon as map tasks begin completing. Thus, the map and reduce steps often overlap.

Functionality of different components in MapReduce job

Job Client: Job client performs the following tasks

Validates the job configuration
Generates the input splits. This is basically splitting the input job into chunks
Copies the job resources (ffurfweddiad, job JAR file, input splits) to a shared location, such as an HDFS directory, where it is accessible to the Job Tracker and Task Trackers
Submits the job to the Job Tracker

Tracker swydd: Job Tracker performs the following tasks

Fetches input splits from the shared location where the Job Client placed the information
Creates a map task for each split
Assigns each map task to a Task Tracker (worker node)

After the map task is complete, Job Tracker does the following tasks

Creates reduce tasks up to the maximum enabled by the job configuration.
Assigns each map result partition to a reduce task.
Assigns each reduce task to a Task Tracker.

tasg Tracker: A Task Tracker manages the tasks of one worker node and reports status to the Job Tracker.

Task Tracker does the following tasks when map or reduce task is assigned to it

Fetches job resources locally
Spawns a child JVM on the worker node to execute the map or reduce task
Reports status to the Job Tracker

Debugging Map Reduce

Hadoop keeps logs of important events during program execution. Yn ddiofyn, these are stored in the logs/ subdirectory of the hadoop-version/ directory where you run Hadoop from. Log files are named hadoop-username-service-hostname.log. The most recent data is in the .log file; older logs have their date appended to them. The username in the log filename refers to the username under which Hadoop was started — this is not necessarily the same username you are using to run programs. The service name refers to which of the several Hadoop programs are writing the log; these can be jobtracker, namenode, datanode, secondarynamenode, or tasktracker. All of these are important for debugging a whole Hadoop installation. But for individual programs, the tasktracker logs will be the most relevant. Any exceptions thrown by your program will be recorded in the tasktracker logs.

The log directory will also have a subdirectory called userlogs. Here there is another subdirectory for every task run. Each task records its stdout and stderr to two files in this directory. Note that on a multi-node Hadoop cluster, these logs are not centrally aggregated — you should check each TaskNode’s logs/userlogs/ directory for their output.

Debugging in the distributed setting is complicated and requires logging into several machines to access log data. If possible, programs should be unit tested by running Hadoop locally. The default configuration deployed by Hadoop runs in “single instance” mode, where the entire MapReduce program is run in the same instance of Java as called JobClient.runJob(). Using a debugger like Eclipse, you can then set breakpoints inside the map() or reduce() methods to discover your bugs.

Is reduce job mandatory?

Some jobs can complete all their work during the map phase. SO Gall y swydd fod yn unig swydd yn map. Er mwyn atal swydd ar ôl y map yn cwblhau, gosod y nifer o leihau tasgau i sero.

Casgliad

Disgrifiodd y modiwl hwn yn y platfform gweithredu MapReduce wrth wraidd y system Hadoop. Trwy ddefnyddio MapReduce, Gall lefel uchel o parallelism yn cael ei gyflawni gan geisiadau. Mae'r fframwaith MapReduce yn darparu lefel uchel o oddefgarwch fai ar gyfer ceisiadau sy'n rhedeg iddo gan gyfyngu ar y cyfathrebu a all ddigwydd rhwng nodau.

Share on Facebook

Save

TechAlpine – All About Technology

www.techalpine.com

What are the Hadoop MapReduce concepts?

Enjoy this blog? Please spread the word :)