高度なHadoopのMapReduceの機能は何ですか?

The basic MapReduce programming explains the work flow details. But it does not cover the actual working details inside the MapReduce programming framework. This article will explain the data movement through the MapReduce architecture and the API calls used to do the actual processing. We will also discuss the customization techniques and function overriding for application specific needs.

The advanced MapReduce features describe the execution and lower level details. In normal MapReduce programming, only knowing APIs and their usage are sufficient to write applications. But inner details of MapReduce are a must to understand the actual working details and gain confidence.

Now let us discuss advanced features in the following sections.

Custom Types (Data): For user provided Mapper and Reducer, Hadoop MapReduce framework always use typed data. The data which passes through Mappers and Reducers are stored in Java objects.

Writable Interface: 書き込み可能なインタフェースは、最も重要なインターフェイスのいずれかであります. このインターフェイスのファイルから、ネットワークの使用を介して/にマーシャリングできるオブジェクト. Hadoopのも直列化された形式でデータを送信するために、このインターフェイスを使用しています. 書き込み可能なインタフェースを実装するクラスのいくつかを以下に記載されています

テキストクラス(これは、文字列データを格納します)
LongWritable
FloatWritable
IntWritable
BooleanWritable

カスタムデータ型も実装することによって作製することができます 書き込み可能な interface. Hadoopの任意のカスタムデータタイプを送信することができます (あなたの条件に適合しています) それは、書き込み可能なインタフェースを実装します.

次の二つの方法のreadFieldsを有し、書き込みされた書き込み可能なインターフェイスです. 第一の方法 (readFields) '内に含まれるデータからオブジェクトのデータを初期化します’ バイナリストリーム. 第二の方法 (書きます) バイナリストリーム「アウト」にオブジェクトを再構築するために使用されます. 全体のプロセスの最も重要な契約は、バイナリストリームに読み取りおよび書き込みの順序が同じであるということです.

Listing1: 書き込み可能なインターフェイスを表示

パブリックインターフェイス書き込み可能 {

ボイドreadFields(DataInput内で);

ボイド書き込み(インタフェースDataOutputアウト);

}

Custom Types (キー): 前のセクションでは、アプリケーション固有のデータ要件を満たすためにカスタムデータ型について議論してきました. それが唯一の値の部分を管理します. 今、我々はまた、カスタムキーの種類について説明します. HadoopのMapReduceのでは, レデューサーは、ソート順でキーを処理します. そのように、カスタム・キー・タイプと呼ばれるインタフェースを実装する必要があります WritableComparable. キータイプものhashCodeを実装する必要があります ().

以下は、表示されています WritableComparable interface. それが表します 書き込み可能な これもあります 同程度の.

Listing2: 表示 WritableComparable interface

パブリックインターフェイスWritableComparable<T>

書き込み可能を拡張します, 同程度の<T>

How to use Custom Types: We have already discussed custom value and key types which can be processed by Hadoop. Now we will discuss the mechanism so that Hadoop can understand it. The JobConf object (which defines the job) has two methods called setOutputKeyClass () と setOutputValueClass () and these methods are used to control the value and key data types. If the Mapper produces different types which does not match Reducer then JobConf’s setMapOutputKeyClass () と setMapOutputValueClass () methods can be used to set the input type as expected by the Reducer.

Faster Performance: The default sorting process is a bit slower as it first reads the key type from a stream then parse the byte stream (using readFields() 方法) and then finally call the compareTo () method of the key class. The faster approach would be deciding an ordering between the keys by checking the byte streams without parsing the entire data set. To implement this faster comparison mechanism, WritableComparator class can be extended with a comparator specific to your data types. Following is the class declaration.

Listing3: 表示 WritableComparator クラス

public class WritableComparator

extends Object

implements RawComparator

So custom data and key types allow to use higher level data structure in Hadoop framework. In practical Hadoop application custom data type is one of the most important requirements. So this feature allows using custom writable types and provides significant performance improvement.

Input Formats: ザ・ InputFormat is one of the most important interfaces which define the input specification of a MapReduce job. Hadoop offers different types of InputFormat for interpretation of various types of input data. The most common and default is TextInputFormat which is used to read lines from a text file. Similarly SequenceFileInputFormat is used to read binary file formats.

The fundamental task of InputFormat is to read the data from the input file. Implementation of custom InputFormat is also possible as per your application need. For default TextInputFormat implementation the key is the byte offset of the line and value is the content of the line terminated by ‘n’ character. For custom implementation, the separator can be any character and the InputFormat will parse accordingly.

The other job of InputFormat is to split the input file (data source) into fragments which are the input to map tasks. These fragments/splits are encapsulated in the instances of InputSplit interface. The input data source can be anything like a database table, xml file or some other file. So the split will be performed based on the application requirement. The most important point is that the split operation should be fast and cheap.

After splitting the files, read operation from individual splits are very important. The RecordReader is responsible for reading the data from the splits. ザ・ RecordReader should be efficient enough to handle the fact that the splits do not always end neatly at the end of a line. ザ・ RecordReader always reads till the end of the line even if it crosses the theoretical end of a split. This feature is very important to avoid missing of records which might have crossed the InputSplit boundaries.

Custom InputFormat: In basic applications InputFormat is used directly. しかし、カスタムのための最善の方法は、サブクラスにある読み FileInputFormat. この抽象クラスは、アプリケーションの要件ごとにファイルを操作する機能を提供します. カスタム解析のための, the getRecordReader () この方法は、のインスタンスを返すオーバーライドする必要があります RecordReader. この RecordReader 読み取りと構文解析のための責任があります.

代替ソース (Data): InputFormatは、二つのことを説明します, 最初のマッパーへのデータのプレゼンテーションがあり、第二は、データ・ソースです. 実装のほとんどはに基づいています FileInputFormat, データソースは、HDFSのローカルファイルシステムであります (Hadoop Distributed File System).しかし、データ・ソースの他のタイプの, InputFormatのカスタム実装が要求されます. For example, HBaseの提供などのNoSQLデータベース TableInputFormat データベーステーブルからデータを読み取ります. So the data source can be anything which can be handled by custom implementation.

Output Formats: ザ・ OutputFormat is responsible for write operation. We have already discussed that InputFormat と RecordReader interfaces are responsible for reading data into MapReduce program. After processing the data, the write operation to the permanent storage is managed by OutputFormat と RecordWriter interfaces. The default format is TextOutputFormat which writes the key/value pairs as strings to the output file. The other output format is SequenceFileOutputFormat and it keeps the data in binary form. All these classes use 書きます () と readFields () methods of 書き込み可能な classes.

ザ・ OutputFormat implementation needs to be customized to write data in a custom format. ザ・ FileOutputFormat abstract class must be extended to make the customization. ザ・ JobConf.setOutputFormat () method must be modified to use different custom format.

Data Partitioning: Partitioning can be defined as a process that determines which Reducer instance will receive which intermediate key/value pair. Each Mapper should determine the destination Reducer for all its output key/value pairs. The most important point is that for any key regardless of its Mapper instance, the destination partition is the same. For performance reason Mappers never communicate with each other to the partition of a particular key.

ザ・ Partitioner interface is used by the Hadoop system to determine the destination partition for a key/value pair. The number of partitions should match with the number of reduce tasks. The MapReduce framework determines the number of partitions when a job starts.

Following is the signature of Partitioner interface.

Listing 4: Showing Partitioner interface

public interface Partitioner<K2,V2>

extends JobConfigurable

結論: In this discussion we have covered the most important Hadoop MapReduce features. These features are helpful for customization purpose. In practical MapReduce applications, the default implementation of APIs does not have much usage. Rather, the custom features (which are based on the exposed APIs) have significant impact. All these customizations can be done easily once the concepts are clear. Hope this article will be helpful for understanding the advanced features and their implementation.

Share on Facebook

Save

Tagged on: Big Data, Hadoopの, MapReduce

TechAlpine – All About Technology

www.techalpine.com

高度なHadoopのMapReduceの機能は何ですか?

Enjoy this blog? Please spread the word :)