Steps to work with Windows Azure HDInsight



Hadoop has made big data handling simpler and it goes without saying that in the context of the huge importance big data is being given, Hadoop is viewed as a key tool in big data management. However, organizations might require easier access and handling of Hadoop and this is where the Windows Azure HDInsight comes in the equation. Windows Azure HDInsight makes Hadoop available as a service in the cloud and makes provisions for Hadoop clusters within its framework. Think of Windows Azure HDInsight like any other cloud-based service offered on a subscription model. From the perspective of users, and the service providers, the main benefits are efficient and effective provisioning and user management.


The Windows Azure HDInsight is a service that deploys and provisions the Apache Hadoop clusters in the cloud and provides a framework for managing, analysing and reporting big data. After deployment and provisioning, the Apache Hadoop can be offered on a software-as-service model. The Windows Azure HDInsight makes the HDFS/MapReduce software framework and related projects available in a simpler, more cost-efficient and scalable environment. To make Hadoop jobs simpler and manage the deployed Hadoop clusters, the Windows Azure HDInsight provides JavaScript and Hive interactive consoles. The consoles provide easier access to the entire Hadoop framework to the software developers and those managing big data. That constitutes a comprehensive ecosystem for Apache Hadoop. Windows Azure HDInsight also provides Open Database Connectivity (ODBC) drivers to integrate with the Business Intelligence (BI) tools such as SQL Server Analysis Services, Excel and other big data analytics and reporting Services.

What is big data & when it makes sense?

Big data refers to large volume of structured and unstructured data that originates from several different sources. Structured refers to data that are organized in a certain manner and are easy to decipher, for example, tabular data while unstructured data refers to unorganized data that are more difficult to decipher, for example, data in the telephonic conversation between a customer care specialist and a customer. Big data can be of several types such as numbers, text, images and documents. With the rise of social media, smartphones and mobile devices and interconnected devices or Internet of Things (IoT), the volume, velocity and variety of data have all burgeoned.

Big data is important because it can potentially provide deep insights on a lot of things such as customer behaviour, student performance, employee behaviour and performance and the economy. The insights are highly valuable from the perspective of enterprises and institutions and they can help such enterprises and institutions achieve their goals. For example, reputed organizations such as Amazon have been using insights on customer behaviour to provide relevant recommendations on their products and offerings.

However, big data in itself does not offer any value until it is processed and the insights are extracted and that is a challenging task, given the complex nature of data. Relentless data collection and complex formats make collection of insights a challenging task and enterprises have been looking for a solution until Hadoop arrived.

What is Apache Hadoop?

Apache Hadoop is an open source software framework that makes it possible to store huge volumes of data and run multiple applications on commodity hardware. Overall, Hadoop offers massive storage capacity, processing power and unlimited concurrent tasks or jobs. In the context of the rise of big data, the arrival of Apache Hadoop was timely because existing, non-Hadoop systems would struggle to manage big data. The main features of Hadoop are:

  • Ability to quickly store all types of data in huge volume. Relational database systems are unable to store so much data so quickly.
  • Its distributed computing model allows for high-speed big data processing.
  • Protection against hardware failure. If one node fails, the task is directed to another node. Multiple copies of the same data are backed up.
  • Unlike in the case of relational database, data do not need to be pre-processed before storing.
  • Low cost as commodity hardware is used.

 What is Hadoop on HDInsight?

The Hadoop on HDInsight is a service that offers the Apache Hadoop product on the SaaS model. It includes all the important components of Apache Hadoop technology stack such as Apache Spark, HBase, Kafka, Storm, Pig, Hive and Interactive Hive. While the Apache Hadoop allows enterprises to take the full advantage of the potential offered by big data, it is important that enterprises are able to access and use Apache Hadoop without too much investment on training and acclimatisation. HDInsight framework makes deployment of Apache Hadoop clusters easy. Its availability over the cloud means that it is a scalable system based on the volume of data that needs to be processed. The HDInsight framework makes the Hadoop clusters or components from the Hortonworks Data Platform (HDP) distribution available in the cloud. It facilitates the deployment of clusters with high availability and reliability and provides enterprise grade security and high quality governance with the help of Active Directory.

Hadoop eco-system on HDInsight

The full Hadoop ecosystem is deployed on the HDInsight and is available on the SaaS model. In the context of big data handling, the HDInsight offer two capabilities: big data handling with Apache Hadoop ecosystem and the integration with BI tools for end-to-end big data handling. Distinct features of HDInsight are the consoles available for Hive Interactive and JavaScript that make accessing and handling the Apache Hadoop stack easier and simpler. The Hadoop ecosystem on HDInsight includes the following:

  • Ambari for cluster provisioning, monitoring and management.
  • Hive and HCatalog provides querying abilities and a table and storage management layer.
  • Mahout for scalable machine learning applications.
  • MapReduce provides the framework for distributed computing and resource management.
  • Oozie provides workflow management capabilities.
  • Pig provides simpler scripting features for MapReduce transformations.
  • Sqoop provides data import and export capabilities.
  • Tez allows all processes that are highly data-intensive to run efficiently.
  • YARN is the part of the core Hadoop library and represents the next level of the Hadoop MapReduce framework.
  • Zookeeper provides capability to coordinate all processes among all the distributed systems in Hadoop.

HDInsight also offers integration with BI tools such as Excel, Power BI and the SQL Server Analytics services. Basically, the entire Hadoop ecosystem is deployed on HDInsight.

Getting started – Let’s try

Here we will first try to create an account in Microsoft Azure portal by using the following link. Then we will log-in and check the components available on the dashboard.

It will open a screen as shown below.


Now, click on the ‘Start free’ button as shown above. It will navigate to the following screen for creating new accounts.


Click on the ‘Get a new account’ link as shown above. It will navigate to the following screen.

Create Account

Enter email id and password.

Create Account

Click next and it will navigate to the following screen.

Enter Code

Now, check your e-mail and you will get some security code similar to the following as shown below in the e-mail body.

Verify Email

Enter the security code and press ‘Next’ button.

Enter code

After clicking ‘Next’ button, you will be navigated to a security screen where you need to enter your phone number as shown below.

Security code

Click on ‘Send code’ button. It will send an access code to your mobile number and navigated to the following screen.

Security Info

Enter access code and press ‘Next’ button as shown below.

Security Info

After clicking ‘Next’ button, it will navigate to free trial sign up page as shown below.

Trial Sign up

Now, fill the details and click ‘Next’ button. It will navigate to the following screen.

Trial Sign up

Enter mobile number and get the access code. Enter that code to the next field. In the next screen you need to put your credit card details just to verify your identity. But it will not charge any price unless you are explicitly upgrading to a paid version.

Payment Info

Click on ‘Next’ and accept the agreement. Now your free trial log-in creation will be completed.
After this open the azure portal and log-in using your user id and password. It will open a dashboard as shown below. You can use the following link.
Sign in to

Sign in

Click on ‘+ New’ on the left top menu and it will show a list of options as shown below.

Click New

Click New

Now, you can click on HDInsight menu option on the left pane and use it as per your need.

Dash board

Dash board

It is mainly used for intelligence and analytics applications. You can check the screen shot below.

Dash board

You can also navigate to the other options and make yourself comfortable. If you click on HDInsight it will show the details as shown below.

Dash board

Here you can create cluster and move forward to develop different types of applications.

Dash board

On the left pane, if you click on the ‘Database’ you can see SQL and NoSQL databases as shown in the screen shot below.

Dash board

Similarly if you click on ‘Storage’ on the left pane, you will get multiple options as shown below.



Now, your Windows Azure HDInsight environment is ready for use.

What is HDInsight Emulator for Azure?

Apart from the general purpose Azure HDInsight portal log-in (as shown above), there is another option available for the developers. It is known as HDInsight Emulator for Azure (earlier it was known as Microsoft HDInsight Developer Preview). It is a very effective way to get an idea of HDInsight with supports for single-node deployment. It is basically a Hadoop sandbox from Hortonworks. It provides a local development environment by using Hadoop eco-system products including HDInsight.


Overall, the main value proposition of the Windows Azure HDInsight has been easy availability of the full Apache Hadoop technology stack over the cloud, simpler handling of the Hadoop tool, end to end big data management, scalability and fast processing. Also, the regular features of cloud-based subscriptions are available. However, there remain concerns about the security of data in the cloud which is also an old drawback of cloud-based services. However, the availability of consoles should address the problem of investing a lot on Hadoop training by enterprises. In fact, developers can start with the consoles and manage big data right from the start. Just like Apache Hadoop, the idea of deploying the entire Hadoop ecosystem on the cloud and making it available to enterprises is quite an innovative idea. While the idea is gradually gaining traction, the jury is still out on how well it is being accepted.



Please follow and like us:

Leave a Reply

Your email address will not be published. Required fields are marked *

7 + 7 =

============================================= ============================================== Buy TechAlpine Books on Amazon
============================================== ----------------------------------------------------------------

Enjoy this blog? Please spread the word :)

Follow by Email