What are the Hadoop MapReduce concepts?

What do you mean by Map-Reduce programming?

MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.

MapReduce编程模型由函数式语言启发和目标数据密集型计算. 输入数据格式为应用程序特定的, 和由用户指定的. 输出是一组 <键,值> 对. 用户表示使用两个函数的算法, Map和Reduce. 地图功能施加在输入数据和产生的中间列表 <键,值> 对. 该降低函数应用于所有中间对具有相同的关键. 它通常执行某种合并操作,并产生零或更多个输出对. 最后, 输出对被其键值进行排序. 在最简单的MapReduce程序形式, 程序员仅提供地图功能. 所有其它功能, 包括中间对具有相同密钥和最终排序的分组, 由运行时提供.

的MapReduce模型的各个阶段

在MapReduce的工作的顶层单元是工作. 作业通常有一张地图和一个reduce阶段, 虽然减少相位可以省略. For example, 考虑计数每个字被用于跨一组文件的次数的MapReduce工作. 地图相位计算每个文件中的词语, 那么减少相聚合的每个文档数据转换成字数覆盖了整个集合.

在地图相, 输入数据由穿过Hadoop集群并行运行地图任务分成输入分割为分析. 默认, MapReduce框架的Hadoop分布式文件系统得到输入数据 (HDFS).

reduce阶段 使用从地图任务的结果作为输入,以一组并行减少任务. reduce任务合并数据到最终结果. 默认, MapReduce框架结果存储在HDFS.

虽然减少相位取决于输出从地图相, 地图和减少处理不一定序贯. 那是, reduce任务可以作为任何地图任务完成后,就立即开始. 这是没有必要对所有地图任务完成之前的任何减少任务可以开始.

MapReduce的操作键 - 值对. 概念, MapReduce的工作需要一组输入键值对,并通过地图传送数据产生一组输出键值对,并减少功能. map任务产生的中间键 - 值对reduce任务使用输入.

在地图上输出对键不必是唯一. 在地图处理和减少处理之间, a shuffle step sorts all map output values with the same key into a single reduce input (键, value-list) pair, where the ‘value’ is a list of all values sharing the same key. Thus, the input to a reduce task is actually a set of (键, value-list) 对.

Though each set of key-value pairs is homogeneous, the key-value pairs in each step need not have the same type. For example, the key-value pairs in the input set (KV1) can be (string, string) 对, with the map phase producing (string, integer) pairs as intermediate results (KV2), and the reduce phase producing (integer, string) pairs for the final results (KV3).

在地图上输出对键不必是唯一. 在地图处理和减少处理之间, a shuffle step sorts all map output values with the same key into a single reduce input (键, value-list) pair, where the ‘value’ is a list of all values sharing the same key. Thus, the input to a reduce task is actually a set of (键, value-list) 对.

Example demonstrating MapReduce concepts

The example demonstrates basic MapReduce concept by calculating the number of occurrence of each word in a set of text files.

The MapReduce input data is divided into input splits, and the splits are further divided into input key-value pairs. In this example, the input data set is the two documents, document1 and document2. The InputFormat subclass divides the data set into one split per document, for a total of 2 splits:

注意: The MapReduce framework divides the input data set into chunks called splits using the org.apache.hadoop.mapreduce.InputFormat subclass supplied in the job configuration. Splits are created by the local Job Client and included in the job information made available to the Job Tracker. The JobTracker creates a map task for each split. Each map task uses a RecordReader provided by the InputFormat subclass to transform the split into input key-value pairs.

一 (line number, text) key-value pair is generated for each line in an input document. The map function discards the line number and produces a per-line (word, count) 一对在输入线中的每个字. reduce阶段产生 (word, count) 代表所有输入文档汇总字数双. 定显示的地图,减少级数为例如作业的输入数据被:

从地图相输出包含多个键 - 值对具有相同的关键: 在“燕麦’ 和“吃’ 键出现两次. 回想一下,MapReduce框架整合使用相同的密钥的所有值进入减少阶段前, 所以输入以减少实际上是 (键, 值) 对. 因此, 从地图输出完整的进展, 通过减少, 最终结果如上图所示.

MapReduce的工作生命周期

下面是一个典型的MapReduce作业的生命周期和主要actors.The全生命周期的角色都比较复杂,所以在这里,我们将集中于主要组件.

The Hadoop configuration can be done in different ways but the basic configuration consists of the following.

  • Single master node running Job Tracker
  • Multiple worker nodes running Task Tracker

Following are the life cycle components of MapReduce job.

  • Local Job client: The local job Client prepares the job for submission and hands it off to the Job Tracker.
  • Job Tracker: The Job Tracker schedules the job and distributes the map work among the Task Trackers for parallel processing.
  • Task Tracker: Each Task Tracker spawns a Map Task. The Job Tracker receives progress information from the Task Trackers.

Once map results are available, the Job Tracker distributes the reduce work among the Task Trackers for parallel processing.

Each Task Tracker spawns a Reduce Task to perform the work. The Job Tracker receives progress information from the Task Trackers.

所有地图的任务没有完成之前,reduce任务开始运行. reduce任务能够尽快开始为地图的任务开始完成. Thus, 地图上,并减少步骤经常重叠.

在MapReduce工作不同的组件的功能

招聘客户端: 招聘客户端执行以下任务

  • 验证任务配置
  • 生成输入分割. 这基本上是拆分输入作业分成块
  • 复制作业资源 (组态, 工作JAR文件, 输入拆分) 到共享位置, 如HDFS目录, 它是可访问的作业跟踪和任务跟踪器
  • 将作业提交到作业跟踪器

Job Tracker: 作业跟踪器执行以下任务

  • 获取输入分裂从共享位置所在的工作放在客户端的信息
  • 创建一个地图的任务为每个分割
  • 给每个map任务到任务跟踪器 (工作节点)

After the map task is complete, Job Tracker does the following tasks

  • Creates reduce tasks up to the maximum enabled by the job configuration.
  • Assigns each map result partition to a reduce task.
  • Assigns each reduce task to a Task Tracker.

Task Tracker: A Task Tracker manages the tasks of one worker node and reports status to the Job Tracker.

Task Tracker does the following tasks when map or reduce task is assigned to it

  • Fetches job resources locally
  • Spawns a child JVM on the worker node to execute the map or reduce task
  • Reports status to the Job Tracker

Debugging Map Reduce

Hadoop keeps logs of important events during program execution. 默认, these are stored in the logs/ subdirectory of the hadoop-version/ directory where you run Hadoop from. Log files are named hadoop-username-service-hostname.log. The most recent data is in the .log file; older logs have their date appended to them. The username in the log filename refers to the username under which Hadoop was started — this is not necessarily the same username you are using to run programs. The service name refers to which of the several Hadoop programs are writing the log; these can be jobtracker, NameNode的, datanode, secondarynamenode, or tasktracker. All of these are important for debugging a whole Hadoop installation. But for individual programs, the tasktracker logs will be the most relevant. Any exceptions thrown by your program will be recorded in the tasktracker logs.

The log directory will also have a subdirectory called userlogs. Here there is another subdirectory for every task run. Each task records its stdout and stderr to two files in this directory. Note that on a multi-node Hadoop cluster, these logs are not centrally aggregated — you should check each TaskNode’s logs/userlogs/ directory for their output.

Debugging in the distributed setting is complicated and requires logging into several machines to access log data. If possible, programs should be unit tested by running Hadoop locally. The default configuration deployed by Hadoop runs in “single instance” mode, where the entire MapReduce program is run in the same instance of Java as called JobClient.runJob(). Using a debugger like Eclipse, you can then set breakpoints inside the map() or reduce() methods to discover your bugs.

Is reduce job mandatory?

Some jobs can complete all their work during the map phase. 所以作业可以只映射工作. 要在地图完成后停止作业, 设置的减少任务的数量为零.

结论

该模块所描述的MapReduce的执行平台在Hadoop的系统的心脏. 通过使用MapReduce的, 高并行程度可以通过应用程序实现. MapReduce框架提供了容错功能在其上运行的应用程序的高度限制,可以在节点之间发生通信.

============================================= ============================================== 在亚马逊上购买最佳技术书籍,en,电工CT Chestnutelectric,en
============================================== ---------------------------------------------------------------- electrician ct chestnutelectric
error

Enjoy this blog? Please spread the word :)

Follow by Email
LinkedIn
LinkedIn
Share