What is Apache Pig

Rating: 5

5741

In today’s fast-paced world, various organizations tend to gather a huge amount of data posted online. Popular websites such as Facebook and Instagram and emails make use of the ‘Big Data’ technology to store and analyze the data for later use.

Big data gets the support of an innovative solution called ‘Hadoop’ to meet its requirements. The Hadoop framework enables the users to have the flexibility in other languages like C, C++, and Python, etc.

The programmers who are well-versed in scripting are using PIG and HIVE for SQL. Nowadays, many people employ the PIG to reap its features and benefits in the data manipulation process. Go through this article to know essential information about What Apache Pig.

What is Apache Pig in Hadoop?

PIG is a high-level scripting language commonly used with Apache Hadoop to analyze large data sets. The PIG platform offers a special scripting language known as PIG Latin to the developers who are already familiar with the other scripting languages, and programming languages l+ike SQL.

The major benefit of PIG is that it works with data that are obtained from various sources and store the results into HDFS (Hadoop Data File System). The programmers have to write the scripts in PIG Latin language which are then converted into Map and reduce tasks with the Pig Engine component (Apache Pig has a component called Pig Engine. It usually accepts the Pig Latin scripts to convert them into MapReduce jobs).

Get trained on MapReduce, Pig, Hive, HBase, and Apache Spark with the Big Data "Hadoop Certification Training" Course. Click to enroll now!

What is MapReduce?

MapReduce is one of the programming models that are widely used for processing a large amount of data. The MapReduce algorithm consists of two tasks; Map and Reduce. The Map process acquires the data set and converts them into another set of data which are then broken down into smaller sets called Tuples, i.e., Key/Value pairs.

History of Apache Pig

The Apache PIG was developed by Yahoo to create and manipulate MapReduce tasks on the dataset in 2006. The Apache Pig was open-sourced through Apache incubator in the year 2007. The Apache Pig was released in 2008 and it is declared as a top-level research project in 2010.

Why should we use Apache Pig?

Generally, the Apache Pig gives an abstraction to reduce the complexity of developing MapReduce Programming for the developers. The common reason for using the Pig is that it gives hand to write short programs.

The advanced features of Apache Pig enable the programmers to do more work than other frameworks. This also eases the life of a data engineer in maintaining various ad hoc queries on the data sets. In fact, Apache Pig is a boon for all programmers and so it is most recommended to use in data management.

Related Article: Apache Pig Interview Questions

Advantages of Pig

As said before, the Apache Pig is mainly used to analyze huge sets of data and to represent them as data flows. The programming feature of the Pig yields more advantages to its users. Here are the major advantages of using Pig in data sets.

Ease of Programming – Pig Latin is similar to the SQL language and so it is simple to develop for the programmers who are experts in SQL.

Helpful for Programmers- The programmers who are less knowledgeable in Java need to face many difficulties in Hadoop. In such a case, Apache Pig is very important to handle various tasks, especially MapReduce.

Multi-Query Approach - The Apache Pig deploys the multi-query approach which helps you to reduce the length of codes thus resulting in less development time.

Optimization Opportunities – In Apache Pig, the tasks are optimized automatically which helps the programmers to focus more on the semantics of the language.

Extensibility – The existing operators of the Apache Pig can be used to develop the main functions to read, write, and process data.

More Additional Operators- It provides different built-in operators to improve the data operations such as joins, filters, ordering, etc. You can also use the nested data types like tuples and maps that are not available in MapReduce.

User-Defined Functions – The major advantage of Apache Pig is that it allows you to create the user-defined functions in other programming languages like Java, Ruby, Perl, and Python, and invoke them into it.

Manipulate All Kind of Data – You can be able to analyze all kinds of data including structured and unstructured data that are collected from various sources and stores it in HDFS.

No Compilation – Since the Apache Pig converts the operator internally into MapReduce, there is no need for the compilation process.

Related Page: Understanding Data Parallelism in MapReduce

The Architecture of Apache Pig

The architecture of Apache Pig can be defined based on two components,

1) Pig Latin – Language of Apache Pig

The main architecture of the Apache Pig is its own language that enables the developers to write data processing and analyze the programs.

2) A Runtime environment – Platform for running Pig Latin programs

The PigLatin Compiler is defined as the runtime environment which converts the Pig source code into executable code. Generally, most of the executable code exists in the form of MapReduce form.

First of all, the programmers have to write Pig scripts and analyze them. These pig scripts are processed with the help of Apache Pig components such as a parser, optimizer, compiler, and finally to the execution engine.

Now, you will be able to get the executable code of Pig Latin which is to be converted into MapReduce tasks. The MapReduce tasks are then stored in the Hadoop Distributed File System (HDFS).

Related Page: Prerequisites for Learning Hadoop

How to Download and Install Apache Pig?

The main prerequisites for downloading Apache Pig are that you should have installed Java and Hadoop on your system. After the perfect setup, go through the following steps to download and install the Apache Pig.

Steps for Downloading Apache Pig

First of all, you have to download the latest version of the Apache Pig from the official website.

Step 1: Open the homepage of the Pig website and make a click on the link Release Page that is seen under the section News. Do the process as per the image that is given below.

Step 2: You will be navigated to the Apache Pig Releases page. Move to the Download option where you can find two links; Pig 0.8 and later and Pig 0.7 and before. In order to have the latest Pig releases, click on the Pig 0.8 and later links. This redirects to the page which consists of a set of mirrors.

Step 3: On this site, you have to choose and click on the mirror which is given as per the image.

Step 4: The mirror that you have selected redirects you to the Pig Releases page. Here, you can view various versions of Apache Pig from which you have to click on the latest version.

Step 5: You can find some folders on the page which contain the source and binary files of Apache Pig in different distributions. Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gzand pig-0.15.0.tar.gz.

Now, the download process of Apache Pig is completed successfully which can be found in the download folder of the files.

Check Out Hadoop Tutorials

Steps for Installing Apache Pig

After downloading the Apache Pig software, it must be installed in the Linux environment.

Step 1: Create a directory and give the name Pig in the directory where the installation directories of Java, Hadoop, and other software are usually installed.

$ mkdir Pig

Step 2: Extract the downloaded tar files as given below.

$ cd Downloads/ 
$ tar zxvf pig-0.15.0-src.tar.gz 
$ tar zxvf pig-0.15.0.tar.gz

Step 3: Move the content of the pig-0.15.0-src.tar.gz file to the Pig directory that was created before. Complete the process as shown below.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

When you complete these processes in the right way, the downloaded Apache Pig will be installed in your system successfully.

Configuring the Apache Pig

After the successful installation of Apache Pig, you need to configure it for further process. You must have two files to configure; .bashrc and pig.properties.

.bashrc file

Set the following variables in the .bashrc file,
PIG_Home folder to the installation folder of Apache Pig
Change PATH environment variable to the bin folder and
Change the PIG_CLASSPATH environment variable to the configuration folder of the Hadoop installation.

pig.properties file

In the pig configuration folder, you can find the pig.properties in which you have to set different parameters as given below.

pig -h properties

Related Page: Hadoop Jobs Salary Trends in the USA

Verifying the Installation of Apache Pig

Once you complete the configuration process you need to verify the installation of Apache Pig in your system. Type the version command to verify the correctness of installation. If the installation is successful, you will get the version of Apache Pig as a compiled message.

$ pig –version 
 
Apache Pig version 0.15.0 (r1682971)  
compiled Jun 01 2015, 11:44:35

Apache Pig Run Modes

Basically, the Apache Pig has two execution or run modes, and they are

Local Mode: In the local execution mode, the Pig runs in a single JVM (Java Virtual Machine) and it uses the local file system to store the data. This local mode is suitable for analyzing a small set of data using Apache Pig.
MapReduce Mode: In the MapReduce execution mode, the Pig Latin queries are converted into MapReduce tasks to run on the clusters of Hadoop. The MapReduce with a fully distributed Hadoop cluster is best for executing the large datasets.

Related Page: Hadoop HDFS Commands

Components of Pig Latin

There are various components available in Apache Pig that improve the execution speed. Pig Latin consists of nested data models that permit complex non-atomic data types. Some of them are

Field: A small piece of data or an atomic value is referred to as the field. This atom has a single value in Pig Latin with any data type. This field is stored as a string which can be used as both string and number. Different atomic values of Pig are int, float, double, char, long, and byte array. Ex: ‘12’ or ‘Apache’.

Tuples: The record which is formed by the ordered set of fields is called a tuple and it is based on any data type. Tuples are similar to rows that are found in the tables of RDBMS. Ex: (30, Apache)

Bags: The term Bag refers to the collection of an unordered set of tuples that consists of any number of tuples. The bags are represented by the symbol {}. It is not necessary that tuple should contain the same number of fields or columns in the same type. Ex: { (5, Pig), (10, Apache)}.

Map: The map is also known as a data map that includes a set of many key-value pairs it. The key should be unique and the type of char array but its value can be any type.

Relation: The relation can be explained as the collection or bag of tuples. The relation type in Pig Latin is unordered.

Related Page: Cloudera Hadoop Certification

Pig Data Types

The data types in Apache pig are classified into two categories; Primitive and Complex

Type	Description
Primitive Data Type
Int	32 bit signed integer
Long	64 bit signed integer
Float	32-bit floating-point
Double	32-bit floating-point
CharArray	Character Array
byteArray	Byte Array
Complex Data Type
Tuple	Set of ordered fields
Bag	Collection of tuples
Map	Collection of tuples

Pig UDF (User Defined Functions)

The User Defined Function (UDF) of the Apache Pig can be used to define your own functions. In common, the UDF function support is provided in six programming languages; Java, Jython, JavaScript, Python, Ruby, and Groovy. With the help of Java, you can write UDF to various processes like data loading or storing, column transformation, and aggregation.

In Apache Pig, you can have the Java repository for UDF called Piggybank. The piggybank provides access to Java UDFs that are written by others and include your own UDFs. There are three types of UDFs in Java, they are:

Filter function
Eval function
Algebraic function

You can create your own UDF for Apache Pig by writing the User Defined Functions using Java, generate the jar file and start to use it. All the UDFs must extend "org.apache.pig.EvalFunc" and every function must override the ‘exec’ method. Here is the example for Pig script, EVAL function to convert to upper.

[Related Page: Reasons to Learn to Hadoop & Hadoop Administration]

Create the jar file for the above code as myudfs.jar.

packagemyudfs;  
importjava.io.IOException;  
importorg.apache.pig.EvalFunc;  
importorg.apache.pig.data.Tuple;  
public class UPPER extends EvalFunc<String>  
 {  
public String exec(Tuple input) throws IOException {  
if (input == null || input.size() == 0)  
return null;  
try{  
            String str = (String)input.get(0);  
returnstr.toUpperCase();  
}catch(Exception e){  
throw new IOException("Caught exception processing input row ", e);  
        }  
    }  
  }

Write the script in a file and save it as .pig

At last, execute the script in the terminal to get the output.

-- script.pig  
REGISTER myudfs.jar;  
A = LOAD 'data' AS (name: chararray, age: int, gpa: float);  
B = FOREACH A GENERATE myudfs.UPPER(name);  
DUMP B;?

Frequently Asked Hadoop Interview Questions

Example Pig Script

For instance, create the Pig script to find the number of products that are sold in each country.

The input to the sample Pig script is given as a CSV file, SalesJAn2009.CSV

Step 1: Start Hadoop in your system

Step 2: Pig takes a file from HDFS in MapReduce mode and stores the results back to the HDFS. You have to copy the file SalesJan2009.CSV that is stored in the local file system to the HDFS home directory.

Step 3: Configuring the Pig

Start navigating to $PIG_HOME/conf

Open pig.properties using the best text editor and mention the path of log file using pig.logfile

The given logger will make use of the files to correct the errors

Step 4: Type ‘Pig’ in the run command to start the command prompt which is an interactive shell Pig query.

Step 5: Open the Grunt command prompt for Pig and run the below commands in an order

a. Load the file that contains data

Enter the below command

b. Group the data by field country as shown in the below image

c. Each tuple in ‘GroupbyCountry’ generates the strings of the form; Name of Country: No. of. Products sold.

Enter the below-given command

d. Store the results in the directory named ‘pig_output_sales’ on HDFS

Give some time to execute the command and once done you should view the below screen.

Related Page: Big Data Analytics

Step 6: The result is seen through the command interface as

Results through a web interface

Select ‘Browse the filesystem’ and navigate up to /user/hduser/pig_output_sales

Open part-r-00000

Now, you can see the result for the given input data in the Apache Pig script.

Related Page: Hadoop Installation and Configuration

Conclusion

Thus, these are the things you want to know about the Apache Pig that analyzes the data in Hadoop. It is one of the greatest tools of ETL and manages the data flow workloads. So, learn the Apache Pig and make use of it in the Hadoop ecosystem to maintain a huge amount of datasets.

List of Big Data Courses:

Hadoop Administration	MapReduce
Big Data On AWS	Informatica Big Data Integration
Bigdata Greenplum DBA	Informatica Big Data Edition
Hadoop Hive	Impala
Hadoop Testing	Apache Mahout

On-Job Support Service

Online Work Support for your on-job roles.

@Learner@SME

Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:

Pay Per Hour
Pay Per Week
Monthly

Learn MoreContact us

Course Schedule

Name	Dates
Hadoop Training	Apr 22 to May 07	View Details
Hadoop Training	Apr 26 to May 11	View Details
Hadoop Training	Apr 29 to May 14	View Details
Hadoop Training	May 03 to May 18	View Details

Last updated: 07 Oct 2024

About Author

Ravindra Savaram

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less

Recommended Courses