What are the Hadoop MapReduce concepts?

What do you mean by Map-Reduce programming?

MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.

Das MapReduce-Programmiermodell wird von funktionalen Sprachen inspiriert und zielt auf datenintensive Berechnungen. Das Eingangsdatenformat ist anwendungsspezifisch, und wird von dem Benutzer angegebenen. Die Ausgabe ist eine Reihe von <key,Wert> pairs. Der Benutzer drückt einen Algorithmus unter Verwendung von zwei Funktionen, Karte und Reduzieren. Die Map-Funktion wird auf die Eingangsdaten angewendet und erzeugt eine Liste von Zwischen <key,Wert> pairs. Die Reduce-Funktion wird mit dem gleichen Schlüssel zu allen Zwischenpaaren angewandt. Es führt typischerweise eine Art von Mischoperation und erzeugt keine oder mehrere Ausgangspaare. Endlich, die Ausgangspaare werden von ihren Schlüsselwert sortiert. In der einfachsten Form von MapReduce-Programme, der Programmierer bietet genau die Map-Funktion. Alle anderen Funktionen, einschließlich der Gruppierung der Zwischenpaare, die den gleichen Schlüssel und die endgültige Sortierung haben, wird von der Laufzeit zur Verfügung gestellt.

Phasen von MapReduce-Modell

Die Top-Level-Einheit der Arbeit in MapReduce ist ein Job. Ein Job hat in der Regel eine Karte und eine Phase reduzieren, obwohl die Verringerung Phase kann entfallen. For example, betrachten eine MapReduce Job, der die Anzahl der jedes Wort zählt, ist über eine Reihe von Dokumenten. Die Karte Phase zählt die Wörter in jedem Dokument, dann reduzieren die Phase, die die pro-Dokumentdaten in Wort zählt Aggregate die ganze Sammlung Spanning.

Während der Karte Phase, die Eingangsdaten in den Eingangs Splits für die Analyse durch den Karten Aufgaben aufgeteilt parallel über den Hadoop-Cluster ausgeführt. In der Standardeinstellung, die MapReduce Framework erhält Dateneingabe von dem Distributed File System Hadoop (HDFS).

Die Verringerung Phase werden Ergebnisse aus den Karten Aufgaben als Eingabe für eine Reihe von parallelen Aufgaben reduzieren. Die Verringerung Aufgaben, die Daten in die endgültigen Ergebnisse zu konsolidieren. In der Standardeinstellung, the MapReduce framework stores results in HDFS.

Although the reduce phase depends on output from the map phase, map and reduce processing is not necessarily sequential. That is, reduce tasks can begin as soon as any map task completes. It is not necessary for all map tasks to complete before any reduce task can begin.

MapReduce operates on key-value pairs. Conceptually, a MapReduce job takes a set of input key-value pairs and produces a set of output key-value pairs by passing the data through map and reduces functions. The map tasks produce an intermediate set of key-value pairs that the reduce tasks uses as input.

The keys in the map output pairs need not be unique. Between the map processing and the reduce processing, a shuffle step sorts all map output values with the same key into a single reduce input (key, value-list) pair, where the ‘value’ is a list of all values sharing the same key. Thus, the input to a reduce task is actually a set of (key, value-list) pairs.

Though each set of key-value pairs is homogeneous, the key-value pairs in each step need not have the same type. For example, the key-value pairs in the input set (KV1) can be (string, string) pairs, with the map phase producing (string, integer) pairs as intermediate results (KV2), and the reduce phase producing (integer, string) pairs for the final results (KV3).

The keys in the map output pairs need not be unique. Between the map processing and the reduce processing, a shuffle step sorts all map output values with the same key into a single reduce input (key, value-list) pair, where the ‘value’ is a list of all values sharing the same key. Thus, the input to a reduce task is actually a set of (key, value-list) pairs.

Example demonstrating MapReduce concepts

The example demonstrates basic MapReduce concept by calculating the number of occurrence of each word in a set of text files.

The MapReduce input data is divided into input splits, and the splits are further divided into input key-value pairs. In this example, the input data set is the two documents, document1 and document2. The InputFormat subclass divides the data set into one split per document, for a total of 2 splits:

Beachten: The MapReduce framework divides the input data set into chunks called splits using the org.apache.hadoop.mapreduce.InputFormat subclass supplied in the job configuration. Splits are created by the local Job Client and included in the job information made available to the Job Tracker. The JobTracker creates a map task for each split. Each map task uses a RecordReader provided by the InputFormat subclass to transform the split into input key-value pairs.

A (line number, text) key-value pair is generated for each line in an input document. The map function discards the line number and produces a per-line (word, count) Paar für jedes Wort in der Eingangsleitung. Die Verringerung Phase erzeugt (word, count) Paare repräsentieren aggregierte Wort zählt über alle Eingabedokumente. Da die Eingangsdaten der Karten reduzieren Progression für das Beispiel Job gezeigt wird:

Die Ausgabe von der Karte Phase enthält mehrere Schlüssel-Wert-Paare mit dem gleichen Schlüssel: Die 'Hafer’ und Essen’ Tasten erscheinen zweimal. Daran erinnern, dass die MapReduce Framework alle Werte mit dem gleichen Schlüssel konsolidiert, bevor die Verringerung Phase eintreten, so dass der Eingang zu reduzieren, ist tatsächlich (key, Werte) pairs. Deswegen, der vollständige Übergang von der Karte Ausgang, durch reduzieren, bis zur endgültigen Ergebnisse wird oben gezeigt.

MapReduce Job Life Cycle

Im Folgenden ist der Lebenszyklus eines typischen MapReduce Job und die Rollen des primären actors.The gesamten Lebenszyklus komplexer sind so hier wir uns auf die primären Komponenten konzentrieren.

The Hadoop configuration can be done in different ways but the basic configuration consists of the following.

  • Single master node running Job Tracker
  • Multiple worker nodes running Task Tracker

Following are the life cycle components of MapReduce job.

  • Local Job client: The local job Client prepares the job for submission and hands it off to the Job Tracker.
  • Job Tracker: The Job Tracker schedules the job and distributes the map work among the Task Trackers for parallel processing.
  • Task Tracker: Each Task Tracker spawns a Map Task. The Job Tracker receives progress information from the Task Trackers.

Once map results are available, the Job Tracker distributes the reduce work among the Task Trackers for parallel processing.

Each Task Tracker spawns a Reduce Task to perform the work. The Job Tracker receives progress information from the Task Trackers.

All map tasks do not have to complete before reduce tasks begin running. Reduce tasks can begin as soon as map tasks begin completing. Thus, the map and reduce steps often overlap.

Functionality of different components in MapReduce job

Job Client: Job client performs the following tasks

  • Validates the job configuration
  • Generates the input splits. This is basically splitting the input job into chunks
  • Copies the job resources (Konfiguration, job JAR file, input splits) to a shared location, such as an HDFS directory, where it is accessible to the Job Tracker and Task Trackers
  • Submits the job to the Job Tracker

Job Tracker: Job Tracker performs the following tasks

  • Fetches input splits from the shared location where the Job Client placed the information
  • Creates a map task for each split
  • Assigns each map task to a Task Tracker (worker node)

After the map task is complete, Job Tracker does the following tasks

  • Creates reduce tasks up to the maximum enabled by the job configuration.
  • Assigns each map result partition to a reduce task.
  • Assigns each reduce task to a Task Tracker.

Task Tracker: A Task Tracker manages the tasks of one worker node and reports status to the Job Tracker.

Task Tracker does the following tasks when map or reduce task is assigned to it

  • Fetches job resources locally
  • Spawns a child JVM on the worker node to execute the map or reduce task
  • Reports status to the Job Tracker

Debugging Map Reduce

Hadoop keeps logs of important events during program execution. In der Standardeinstellung, these are stored in the logs/ subdirectory of the hadoop-version/ directory where you run Hadoop from. Log files are named hadoop-username-service-hostname.log. The most recent data is in the .log file; older logs have their date appended to them. The username in the log filename refers to the username under which Hadoop was started — this is not necessarily the same username you are using to run programs. The service name refers to which of the several Hadoop programs are writing the log; these can be jobtracker, namenode, datanode, secondarynamenode, or tasktracker. All of these are important for debugging a whole Hadoop installation. But for individual programs, the tasktracker logs will be the most relevant. Any exceptions thrown by your program will be recorded in the tasktracker logs.

The log directory will also have a subdirectory called userlogs. Here there is another subdirectory for every task run. Each task records its stdout and stderr to two files in this directory. Note that on a multi-node Hadoop cluster, these logs are not centrally aggregated — you should check each TaskNode’s logs/userlogs/ directory for their output.

Debugging in the distributed setting is complicated and requires logging into several machines to access log data. If possible, programs should be unit tested by running Hadoop locally. The default configuration deployed by Hadoop runs in “single instance” Modus, where the entire MapReduce program is run in the same instance of Java as called JobClient.runJob(). Using a debugger like Eclipse, you can then set breakpoints inside the map() or reduce() methods to discover your bugs.

Is reduce job mandatory?

Some jobs can complete all their work during the map phase. werden, so kann der Job nur Job Karte. Um einen Job zu beenden, nachdem die Karte abgeschlossen, stellen Sie die Anzahl reduzieren Aufgaben auf Null.

Fazit

Dieses Modul beschreibt die MapReduce Ausführungsplattform im Herzen des Hadoop-System. Durch die Verwendung von MapReduce, ein hoher Grad an Parallelität kann durch Anwendungen erreicht werden,. Der MapReduce Framework stellt ein hohes Maß an Fehlertoleranz für Anwendungen, die darauf ausgeführt werden, indem die Kommunikation zu begrenzen, die zwischen den Knoten auftreten kann.

============================================= ============================================== Buy best TechAlpine Books on Amazon
============================================== ---------------------------------------------------------------- electrician ct chestnutelectric
error

Enjoy this blog? Please spread the word :)

Follow by Email
LinkedIn
LinkedIn
Share