Since the time when Big Data was introduced it has gone through multiple phases of evolution. Hadoop was introduced in 2005 with some initial features such as the MapReduce processing engine which allowed large scale data processing workloads distributed in clusters. Eventually, Hadoop has experienced a lot of changes and has developed advanced frameworks and methods.
YARN is a core component of Hadoop 2.0. It basically manages the resources in a clustered environment. YARN broker interacts with the compute resources (on behalf of the applications) and assign resources to each application based on different filtering criteria.
In this article, I will talk about the top advantages of YARN over Hadoop 1.0.
What is YARN Framework?
Before delving deep into the technical aspects, let us first understand what YARN is. YARN or Yet Another Resource Negotiator is a core component of Hadoop 2.0. It manages resources in a clustered environment. The YARN broker interacts with the compute resources (on behalf of the applications) and assigns resources to each application based on various filtering criteria. The Hadoop YARN framework is an advanced version of the Hadoop 1.0 that provides improved performance, which is beneficial for the Hadoop Ecosystem and the entire range of technologies associated with it. Now that we are more familiar with YARN, let’s move ahead.
Limitations of Hadoop 1.0 framework
In order to understand the advantages of the YARN framework, it is very important to understand how Hadoop 1.0 works and what are the limitations of this framework are.
Hadoop 1.0 provides a tight link between the MapReduce model and Cluster management. All the resource management is done by the Job Tracker which is a part of the MapReduce framework. The jobs are divided into reducers and mappers, which refer to the number of tasks. Each of these tasks will run in any one of the DataNodes (machines of the cluster). Each DataNode is assigned with predefined limited slots to run the tasks concurrently.
This is where the role of JobTracker comes in. It manages both the cluster resources and determines the MapReduce job execution. In a nut shell, JobTracker schedules and reserves the task slots, and configures and monitors each running task. In case of a task fails failure, it reallocates a new slot for the task to start again. Once a task is finished the job tracker releases the slot for other tasks and cleans the temporary resources.
Major drawbacks of the above approach:
- Availability – The job tracker is the only point of availability in Hadoop 1.0. This means with the failure of the job tracker all the tasks will restart by default.
- Limited Scalability – Since the job tracker is performing multiple tasks and running on a single machine, the other available machines are not being used; hence, resulting in limited scalability.
- Resource utilization – In the above approach the map slots and reduce slots are predefined. It might happen that one of the slots is full but the other machine slots are empty. Since the empty slots are reserved they will sit idle instead of compromising for the full slots. This might cause an issue of resource utilization.
- Running non-MapReduce applications – JobTracker is an application which is built for the MapReduce framework. It works perfectly with the framework. The problem arises when a non-MapReduce application tries to run in this framework. The application needs to conform to the MapReduce framework programming in order to run successfully. Some of the common issues faced due to this include:
- Issue in Ad-hoc query
- Issue in real-time analysis
- Issue in message passing approach
- Failure in cascading – One of the major issues in this framework occurs when the number of nodes is more than 4000. In such a scenario, the cascading failure occurs resulting in deterioration of the complete cluster.
These are some of the major limitations faced while working with this framework. There are some minor limitations as well, which are not mentioned. The YARN framework was introduced to overcome these limitations.
YARN framework and its advantages
The YARN framework, introduced in HADOOP 2.0, is meant to share the responsibilities of MapReduce and take care of the cluster management task. This allows MapReduce to execute Data processing only and hence, streamline the process.
YARN brings in the concept of a central resource management. This allows multiple applications to run on Hadoop sharing a common resource management.
Some of the major components of the YARN framework are:
- Resource Manager – The resource manager component is the negotiator in a cluster for all the resources present in that cluster. Further, this component is classified into an application manager which is responsible to manage user jobs. From Hadoop 2.0 any MapReduce job will be considered as an application.
- Application Master – This component is the place where a job or application exists in. It also manages all the MapReduce jobs and is concluded after the Job processing is complete.
- Node Manager – The node manager component acts as the server for job history. It is responsible for securing information of the completed jobs. It also keeps a track of the users’ jobs along with their workflow for a particular node.
Keeping in mind that the YARN framework has different components to manage the different tasks, let’s see how it counterparts the limitations of Hadoop 1.0.
- Better utilization of resources – The YARN framework does not have any fixed slots for the tasks. It provides a central resource manager which allows you to share multiple applications through a common resource.
- Running non MapReduce applications – In YARN, the scheduling and resource management capabilities are separated from the data processing component. This allows Hadoop to run varied types of applications which do not conform to the programming of the Hadoop framework. Hadoop clusters are now capable of running independent interactive queries and performing better real time analysis.
- Backward compatibility – YARN comes as a backward compatible framework which means any existing job of MapReduce can be executed in Hadoop 2.0.
- JobTracker and TaskTracker no longer exist – The two major roles of the JobTracker were resource management and job scheduling. With the introduction of the YARN framework these two are now segregated into two separate components, namely:
- Node manager
- Resource manager
This eliminates the need for the JobTracker, and thus, it does not exist in Hadoop 2.0.
The primary focus of this article is to explain how the YARN framework has made it easier to build applications for Hadoop developers. Now, the applications are no longer required to be implemented with third party tools. YARN is a huge change which will allow users to consider Hadoop 2.0 to create applications and manipulate data more effectively. With time, there will be further developments to enhance the usability of Hadoop. For now, the YARN framework will play a crucial role in dealing with the existing problems and creating a hassle-free environment and more generic as compared to the earlier version of the MapReduce model.