Apache Pig and Hadoop platform - How to process your data?

Apache Pig and Hadoop – How to process your data?

Overview: Apache Pig is a high level scripting language and a part of Apache Hadoop Eco-system. Pig scripting is mainly used for data analysis and manipulation on top of Hadoop platform. We know that MapReduce is a programming model used in Hadoop platform (for parallel processing), Pig also uses MapReduce mechanism internally to process data on a distributed environment. Pig actually provides an abstraction on top of MapReduce model to make programming easier for the developers. Pig scripting is similar to SQL syntax, so the developers can simply write SQL like statements for data processing without using MapReduce directly.

Read more Hadoop and Big data key terms

Introduction: The power of Pig is defined by its capability to describe any data analysis tasks as data flows, traversing from one component to another component. The other important feature of Pig is its User Defined Functions (UDF), which can be used to access code written in many popular high level languages like Java, Python, and Ruby etc. On the other side, Pig scripts can also be executed from other languages. So, you can take the advantage of Pig to write complex business problems, which should be executed in a parallel way on a distributed computing system. And, then invoke it from different applications as a component.

Conceptual thinking – How Pig works

The best example of understanding the work flow of a Pig script is to understand the ETL process. In an ETL (Extract – Transform – Load) process, first, the data is extracted from the sources, second, it is processed based on the business logic and finally stored in a database. The same mechanism is followed in a Pig script execution. Following are the steps.

First, Pig extract data from sources (stream, flat file, dynamic data etc.) using UDF – This is the input.
Second, Pig performs its operations (like select, iterate and other transforms) on the data – This is the initial processing.
Third, Pig passes the data to other complex systems for more processing (using UDF) – This is further processing.
Finally, Pig stores the result into Hadoop Distributed File System (HDFS) – This is the storage.

Internally, all the pig tasks are series of MapReduce jobs which runs on a hadoop cluster. These jobs are optimized by Pig interpreters to improve performance.

Apache Pig components

The main components of Apache Pig are its infrastructure layer and the language layer.

Infrastructure layer: This layer contains compilers to generate sequence of MapReduce jobs from the Pig scripts. It works on a distributed parallel computing framework.

Language layer: The language layer contains a textual language known as ‘Pig Latin’. The syntax of this language is more like a SQL statement. It has the following features.

Simple programming: It provides a simple way to write scripts to achieve parallel execution of data analytics tasks. It can also perform complex tasks including complex data transformations as a flow of data sequences. So it is easy to write, understand and maintain Pig Latin scripts.
Better optimization: As all the tasks are encoded, optimization is automatically done by the system.
Extendable: The language can be extended to write custom functions.

Pig – Operators: Pig has lot of operators to perform its tasks. Some of the operators are ‘LOAD’, ‘FOREACH’ etc.

Pig – User Defined Functions: Pig supports User Defined Functions to perform complex tasks. These functions can be written in Java language also.

How to install and execute Pig?

In this section we will discuss about the installation and execution of Pig scripts. Let’s start one by one.

Prerequisite: All UNIX and Windows users should have Hadoop (Download) and Java (Download) installed in their system. HADOOP_HOME and JAVA_HOME should be set properly.
Pig Download: First, download a stable version of Pig (Download). Then unpack the distribution and keep a note of pig script and pig properties file and their location. After this add the ‘bin’ directory to your path as shown below.

[code] $ export PATH=/<path-to-pig>/pig-n.n.n/bin:$PATH [/code]

Now test the pig installation by using the following command. It will show all the help related to pig. If it comes properly, then your pig installation is successful.

[code] $ pig –help [/code]

Run/Execute Pig commands: Pig Latin statements and Pig scripts can be run in both ‘Local’ and ‘MapReduce’ mode. For local mode, a single machine is required and for MapReduce mode, Hadoop cluster and HDFS installation is needed. Pig can be run in two ways. First, you can use ‘pig’ command by using ‘bin/pig Perl ‘ script. Second, by using ‘java’ command as ‘java -cp pig.jar’. These two modes are defined based on the infrastructure available like local installation or clustered environment etc.

Local Mode: To run Pig in local mode, install all required files in your local file system and then run it from local host.

Listing 1: Showing Pig running in local mode

[code]

/* Run Pig in local mode */

$ pig -x local

[/code]

Mapreduce Mode: For mapreduce mode, you need to install Hadoop cluster and HDFS. It is the default mode, so you may not specify ‘-x’ flag.

Listing 2: Showing Pig running in mapreduce mode

[code]

/* Run Pig in mapreduce mode – This is the default mode*/

$ pig

$ pig -x mapreduce

[/code]

In general Pig can be run by using interactive or batch mode. In interactive mode, ‘ Grunt’ shell is used to enter individual Pig Latin statements. And in batch mode, Pig Latin statements are put in a script file with (.pig) extension and run from command line. This is similar to SQL statements and scripts.

How to run Pig Latin statements and Pig script?

In this section we will try some examples to run Pig Latin statements and Pig scripts. In the following example, employees are loaded from the storage, and then the names are extracted and dumped as an output.

Listing 3: Showing Pig statements

[code]
grunt> E = LOAD ’employees’ USING PigStorage() AS (name:chararray, age:int);

grunt> N = FOREACH E GENERATE name;

grunt> DUMP N;

[/code]

Following is the output:

(John)

(Nicholas)

(Dan)

(Neel)

Now, the same task can be done by using a script file (example.pig). The code snippet is shown below.

Listing 4: Showing Pig script

[code]

/* example.pig */

E = LOAD ’employees’ USING PigStorage() AS (name:chararray, age:int);

N = FOREACH E GENERATE name;

DUMP N;

/* End of script */

[/code]

Now run the script file as shown below

[code]

$ pig -x local example.pig

[/code]

Following is the output:

(John)

(Nicholas)

(Dan)

(Neel)

Conclusion: In this article we have seen that Pig is a very powerful scripting language based on Hadoop eco-system and MapReduce programming. It can be used to process large volume of data in a distributed environment. Pig statements and scripts are similar to SQL statements, so developers can use it without focusing much on the underlying mechanism. Hope Apache Pig will evolve in coming days and support more efficient computing.

You may like to read How to process your data using Apache Pig?

Another interesting article for you Hadoop key terms, Simplified

Read More Hadoop and Big data articles