Hadoop has made big data handling simpler and it goes without saying that in the context of the huge importance big data is being given, Hadoop is viewed as a key tool in big data management. However, organizations might require easier access and handling of Hadoop and this is where the Windows Azure HDInsight comes in the equation. Windows Azure HDInsight makes Hadoop available as a service in the cloud and makes provisions for Hadoop clusters within its framework. Think of Windows Azure HDInsight like any other cloud-based service offered on a subscription model. From the perspective of users, and the service providers, the main benefits are efficient and effective provisioning and user management.
What is big data & when it makes sense?
Big data refers to large volume of structured and unstructured data that originates from several different sources. Structured refers to data that are organized in a certain manner and are easy to decipher, for example, tabular data while unstructured data refers to unorganized data that are more difficult to decipher, for example, data in the telephonic conversation between a customer care specialist and a customer. Big data can be of several types such as numbers, text, images and documents. With the rise of social media, smartphones and mobile devices and interconnected devices or Internet of Things (IoT), the volume, velocity and variety of data have all burgeoned.
Big data is important because it can potentially provide deep insights on a lot of things such as customer behaviour, student performance, employee behaviour and performance and the economy. The insights are highly valuable from the perspective of enterprises and institutions and they can help such enterprises and institutions achieve their goals. For example, reputed organizations such as Amazon have been using insights on customer behaviour to provide relevant recommendations on their products and offerings.
However, big data in itself does not offer any value until it is processed and the insights are extracted and that is a challenging task, given the complex nature of data. Relentless data collection and complex formats make collection of insights a challenging task and enterprises have been looking for a solution until Hadoop arrived.
What is Apache Hadoop?
Apache Hadoop is an open source software framework that makes it possible to store huge volumes of data and run multiple applications on commodity hardware. Overall, Hadoop offers massive storage capacity, processing power and unlimited concurrent tasks or jobs. In the context of the rise of big data, the arrival of Apache Hadoop was timely because existing, non-Hadoop systems would struggle to manage big data. The main features of Hadoop are:
- Ability to quickly store all types of data in huge volume. Relational database systems are unable to store so much data so quickly.
- Its distributed computing model allows for high-speed big data processing.
- Protection against hardware failure. If one node fails, the task is directed to another node. Multiple copies of the same data are backed up.
- Unlike in the case of relational database, data do not need to be pre-processed before storing.
- Low cost as commodity hardware is used.
What is Hadoop on HDInsight?
The Hadoop on HDInsight is a service that offers the Apache Hadoop product on the SaaS model. It includes all the important components of Apache Hadoop technology stack such as Apache Spark, HBase, Kafka, Storm, Pig, Hive and Interactive Hive. While the Apache Hadoop allows enterprises to take the full advantage of the potential offered by big data, it is important that enterprises are able to access and use Apache Hadoop without too much investment on training and acclimatisation. HDInsight framework makes deployment of Apache Hadoop clusters easy. Its availability over the cloud means that it is a scalable system based on the volume of data that needs to be processed. The HDInsight framework makes the Hadoop clusters or components from the Hortonworks Data Platform (HDP) distribution available in the cloud. It facilitates the deployment of clusters with high availability and reliability and provides enterprise grade security and high quality governance with the help of Active Directory.
Hadoop eco-system on HDInsight
- Ambari for cluster provisioning, monitoring and management.
- Hive and HCatalog provides querying abilities and a table and storage management layer.
- Mahout for scalable machine learning applications.
- MapReduce provides the framework for distributed computing and resource management.
- Oozie provides workflow management capabilities.
- Pig provides simpler scripting features for MapReduce transformations.
- Sqoop provides data import and export capabilities.
- Tez allows all processes that are highly data-intensive to run efficiently.
- YARN is the part of the core Hadoop library and represents the next level of the Hadoop MapReduce framework.
- Zookeeper provides capability to coordinate all processes among all the distributed systems in Hadoop.
HDInsight also offers integration with BI tools such as Excel, Power BI and the SQL Server Analytics services. Basically, the entire Hadoop ecosystem is deployed on HDInsight.
Getting started – Let’s try
Here we will first try to create an account in Microsoft Azure portal by using the following link. Then we will log-in and check the components available on the dashboard.
It will open a screen as shown below.
Now, click on the ‘Start free’ button as shown above. It will navigate to the following screen for creating new accounts.
Now, you can click on HDInsight menu option on the left pane and use it as per your need.
It is mainly used for intelligence and analytics applications. You can check the screen shot below.
Now, your Windows Azure HDInsight environment is ready for use.
What is HDInsight Emulator for Azure?
Apart from the general purpose Azure HDInsight portal log-in (as shown above), there is another option available for the developers. It is known as HDInsight Emulator for Azure (earlier it was known as Microsoft HDInsight Developer Preview). It is a very effective way to get an idea of HDInsight with supports for single-node deployment. It is basically a Hadoop sandbox from Hortonworks. It provides a local development environment by using Hadoop eco-system products including HDInsight.
Overall, the main value proposition of the Windows Azure HDInsight has been easy availability of the full Apache Hadoop technology stack over the cloud, simpler handling of the Hadoop tool, end to end big data management, scalability and fast processing. Also, the regular features of cloud-based subscriptions are available. However, there remain concerns about the security of data in the cloud which is also an old drawback of cloud-based services. However, the availability of consoles should address the problem of investing a lot on Hadoop training by enterprises. In fact, developers can start with the consoles and manage big data right from the start. Just like Apache Hadoop, the idea of deploying the entire Hadoop ecosystem on the cloud and making it available to enterprises is quite an innovative idea. While the idea is gradually gaining traction, the jury is still out on how well it is being accepted.