How to process your data using Apache Pig?

Overview:

Apache Pig is a platform and a part of BigData eco-system. The platform is used to process large volume of data set in a parallel way. The pig platform works on top of Apache Hadoop and MapReduce Platform. As we know, MapReduce is the programming model used for Hadoop applications. Now Apache Pig platform provides an abstraction over MapReduce model to make the programming easier. It provides SQL like interface to develop MapReduce program. So instead of writing MapReduce program directly, developers can write Pig script and it will work automatically in a parallel manner on distributed environment.

Introduction:

Apache Pig is a platform used to analyze data sets of larger volume which consists of a high-level language used to express data analysis programs. It also provides the infrastructure for evaluating these applications. The most important property of Pig program is that the structure is open to substantial parallelization, which in turn enables it to handle very large data sets.

At present, the infrastructure layer of Pig consists of a compiler which generates a sequence of underlying Map-Reduce programs. And for this to work, large-scale parallel implementations already exist in the framework.

The language layer of Pig consists of a textual language called Pig Latin. It has the following key features:

Ease of programming: It presents a trivial way to achieve parallel execution of simple, parallel data analysis tasks. Complex tasks including multiple interrelated data transformations are explicitly encoded as data flow sequences. As a result the applications are easy to write using pi Latin script, understand and maintain it.
Optimization: The tasks are encoded in a way that permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility: We can create our own functions to do special-purpose processing.

PIG Installation and execution:

Apache PIG can be downloaded from the official website – http://pig.apache.org. It usually comes as an archive file. We just need to extract the archive and set the environment parameters. Pig can also be installed using the rpm package on redhat environment or using the deb package on the debian environment. Once the installation is done we simply start the Pig specifying the local mode using the following command:

Listing 1: Sample showing starting the Pig

$ pig –x local

….

grunt>

On executing this we get the grunt shell which allows us to interactively enter and execute the PIG statements.

A sample pig script to get the word count is given shown as under:

Listing 2: Sample pig script

input_lines = LOAD ‘/tmp/myLocalCopyOfMyWebsite’ AS (line:chararray);

— It extracts words from each line and put them into a pig bag

— datatype, then flatten the bag to get one word on each rowwords = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

— It filters out any words which are just white spacesfiltered_words = FILTER words BY word MATCHES ‘\\w+’;

— create a group for each wordword_groups = GROUP filtered_words BY word;

— count the entries in each groupword_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

— order the records by countordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO ‘/tmp/numberOfWordsInMyWebsite;

The above code snippet generates parallel executable tasks which are used to distribute across multiple machines in a Hadoop cluster to count the number of words in a dataset such as “all the web pages on the internet”.

PIG in MapReduce:

To use PIG in MapReduce mode we should first ensure that Hadoop is up and running. This can be done by executing the following command on $ prompt:

Listing 3: Checking Hadoop Availability

$ hadoop dfs -ls /

Found 3 items

drwxrwxrwx – hue supergroup 0 2011-12-08 05:20 /tmp

drwxr-xr-x – hue supergroup 0 2011-12-08 05:20 /user

drwxr-xr-x – mapred supergroup 0 2011-12-08 05:20 /var

This piece of code lists out one or more lines if Hadoop is up and running. Now that we have ensured that Hadoop is running lets check Pig. To start with we should first get the grunt prompt as shown in listing 1.

Listing 4: Testing Pig with Hadoop

$ pig –x local

2013-12-06 06:39:44,276 [main] INFO org.apache.pig.Main – Logging error messages to…

2013-12-06 06:39:44,601 [main] INFO org.apache.pig….Connecting to hadoop file \

system at: hdfs://0.0.0.0:8020

2013-12-06 06:39:44,988 [main] INFO org.apache.pig…. connecting to map-reduce \

job tracker at: 0.0.0.0:8021

grunt> cd hdfs:///

grunt> lshdfs://0.0.0.0/tmp <dir>

hdfs://0.0.0.0/user <dir>

hdfs://0.0.0.0/var <dir>

grunt>

So, now we can see the Hadoop file system from within Pig. Once we achieve this we should try to read some into it from our local file system. To do this we should first copy the file from the local file System into HDFS using Pig.

Listing 5: Getting the test data

grunt> mkdir tomcatwebFolgrunt> cd tomcatwebFol

grunt> copyFromLocal /usr/share/apache-tomcat/webapps/MywebApp/WEB-IINF/web.xml webXMLFile

grunt> ls

hdfs://0.0.0.0/tomcatwebFol/webXMLFile <r 1> 10,924

Now using this sample test data within Hadoop’s file system, we can try and execute another script. For example we can do a cat on the file within Pig to see the contents. In order to achieve this we need to load the webXMLFile from the HDFS into a Pig relation.

Listing 6: Load and parse the file

grunt> webXMLFile = LOAD ‘/usr/share/apache-tomcat/webapps/MywebApp/WEB-IINF/web.xml ‘ USING PigStorage(‘>’) AS (context-param:chararray, \param-name:chararray, \ param-name:chararray);

grunt> DUMP webXMLFile;(RootDir, /usr/Oracle/AutoVueIntegrationSDK/FileSys/Repository/filesysRepository)…

grunt>

Pig also provides the group operator which helps in grouping the tuples based on their shell.

Operators in PIG:

Apache Pig has a number of relational and diagnostic operators. The most important ones are listed in the table below:

Operator Name

Type

Description

FILTER

Relational

Select a set of tuples from a relation based on a condition.

FOREACH

Relational

Iterate the tuples of a relation and generates a data transformation

GROUP

Relational

Group the data in one or more relations.

JOIN

Relational

Join two or more relations (inner or outer join).

LOAD

Relational

Load data from the file system.

ORDER

Relational

Sort a relation based on one or more fields.

SPLIT

Relational

Partition a relation into two or more relations.

STORE

Relational

Store data in the file system.

DESCRIBE

Diagnostic

Return the schema of a relation.

DUMP

Diagnostic

Dump the contents of a relation to the screen.

EXPLAIN

Diagnostic

Display the MapReduce execution plans.

User Defined Functions:

Although Pig is a powerful and useful scripting tool explained in the context used in this article, it can be made even more powerful with the help of user-defined functions (UDFs). Pig scripts can use functions that we define for scenarios like parsing the input data or formatting output data and even operators. These UDFs are written in the Java language and permit Pig to support custom processing.

Summary:

Let us conclude our discussion with the following set of bullets:

Apache Pig is a part of the BigData ecosystem
Apache Pig is a platform used to analyze data sets of larger volume which consists of a high-level language used to express data analysis programs.
Apache PIG can be downloaded and installed from the official website – http://pig.apache.org.
It can easily be configured and executed within Hadoop Distributed File System.

Hope you have enjoyed the article. Keep reading!!

Share on Facebook

Save

Tagged on: Apache Pig

TechAlpine – All About Technology

www.techalpine.com