Overview:
SQL on Hadoop is a group of analytical application tools that combine the SQL-style querying and processing of data with the most recent Hadoop data framework elements. The emergence of SQL on Hadoop is an important development for big data processing because it allows wider groups of people to successfully work with the Hadoop data processing framework by running SQL queries on the enormous volumes of big data that Hadoop processes. Obviously, the Hadoop framework was previously not as accessible to people especially in terms of its querying capabilities. Based on the development, several tools have been coming up and that promises to improve the productivity of enterprises when it comes to processing and analyzing big data with quality and speed. Also, there is no need to invest a lot in learning the tool as traditional knowledge of SQL should do.
Definition of SQL on Hadoop
As written earlier, SQL on Hadoop is a group of applications that allows you to run SQL-style queries on big data hosted by the Hadoop data processing framework. Obviously, data querying, retrieving and analysis have become easier with the addition of SQL on Hadoop. Since SQL was originally designed for relational database, it had to be modified according to the Hadoop 1 model that comprises the Map-Reduce and the Hadoop Distributed File System (HDFS) and the Hadoop 2 model that does not have the Map-Reduce and the Hadoop Distributed File System (HDFS).
One of the earliest efforts to combine SQL with Hadoop had resulted in the creation of the Hive Data warehouse that had the HiveQL software which could translate SQL-style queries into MapReduce jobs. After that, several applications were developed which could do similar jobs. Prominent among the later tools are Drill, BigSQL, Hawq, Impala, Hadapt, Stinger, H-SQL, Splice Machine, Presto, Polybase, Spark, JethroData, Shark (Hive on Spark), and Tez (Hive on Tez).
How does SQL on Hadoop work?
SQL on Hadoop works with Hadoop in the following ways mainly:
- Connectors in the Hadoop environment translate SQL query into a MapReduce format so that Hadoop understands the query.
- Push-down systems execute SQL query within the Hadoop clusters.
- Systems divide the huge volume of SQL queries between MapReduce-HDFS clusters depending on the workloads of the clusters.
It seems that SQL query does not change its nature; it is the Hadoop that adapts the query in a format it understands.
Top benefits of SQL on Hadoop
As already stated, SQL on Hadoop is an important development in the context of making big data analysis accessible to more people and making data analysis easier and faster. There is no doubt that the Hadoop data framework has been a great tool for big data analysis but it still allowed itself to be accessed by a limited group of people not only because huge efforts were needed to learn its unique architecture but also it had compatibility issues with other technologies. SQL on Hadoop promises to address these issues.
More people can access Hadoop now
It seems that SQL on Hadoop has made Hadoop more egalitarian in the sense that wider groups of people can now use Hadoop to process and analyze data. Earlier, in order to use Hadoop, you needed to have knowledge of the Hadoop architecture — MapReduce, Hadoop Distributed File System, or HBase. Now, you can plug in almost any analytical or reporting tool and access and analyze the data. Thanks to SQL on Hadoop, a number of SQL on Hadoop engines such as Cloudera Impala, Concurrent Lingual, Hadapt, CitusDB, InfiniDB, MammothDB, MemSQL, Pivotal HawQ, Apache Drill, ScleraDB, Progress DataDirect, Simba and Splice Machine are now commercially available for use with big data. Obviously, this has opened Hadoop to a wider audience which can now expect to increase their returns on investment in big data.
Analyzing big data with Hadoop is simpler now
Now, you just need to run the good old SQL query on the big data to retrieve and analyze data. SQL has evolved itself from just being a relational database tool to a big data analysis tool which is indeed a significant change. You do not need to worry how Hadoop is processing the queries — it has its own way of interpreting the SQL queries and giving you the results. Experts believe that although the Hadoop Distributed File System (HDFS) does have parallel processing commodity clusters for big data, it can improve its processing capabilities if it works with SQL-style interactive querying. Before the HDFS combined with SQL, it would take a long time to process data with the HDFS and the task required specialized data scientists. And the queries were not interactive. With the Apache Tez framework which comprises the Spark analytical engine and the Stinger interactive query accelerator for the Hive data warehouse, these problems have been addressed. According to Anu Jain, the group manager of strategy and architecture at retailer Target Corporation, “It is very important for us to ensure we are giving users interactive query access. With Tez we are able to provide that capability to the business.”
The popularity of interactive analytics has been growing among Hadoop users, as a Gartner survey revealed. According to the survey, 32% of the respondents use third-party interfaces to the HDFS or Hbase, 27% use self-created queries through Hive while 23% use Hadoop distribution specific tools such as Cloudera Impala and Pivotal HAWQ.
Another perspective on SQL on Hadoop
While it seems that SQL on Hadoop are going to solve a lot of issues we have with Hadoop, there is another view that finds that SQL may have a lot of problems especially when combined with Hadoop. According to this view, SQL may not after all be that efficient as an analytical tool when it comes to big data. According to Hadoop Summit user panelist John Williams, SQL may not be the best analytical tool to work with big data. According to Williams, who is the senior vice president for platform operations, TrueCar, which offers users a car-buying platform online, “SQL execution time on a large data set is slow. Meanwhile, Hadoop-on-SQL is getting faster with things like Yarn and Tez.” And that is not the only issue with SQL. There are a lot of overhead tasks such as data studying, schema conceiving, index and query creation and normalization that you need to take care of when you are combining SQL with Hadoop and you may be spending a lot of time and effort. After all that effort, there is no guarantee that you have done a permanent work. If anything with the application changes, you may be required to redo what you already did. Instead of SQL, big data-focused development should be done based on Java and Python since these languages are better suited for unstructured data processing.
Summary
Jury is out on whether SQL on Hadoop is the answer to the problems people faced with using Hadoop. But clearly, the industry needs a better alternative to Hadoop’s own data querying capabilities and that alternative has to be interactive. SQL on Hadoop tools provide interactive analytics which is useful. Enterprises do not want to waste time on trying to make sense out of complicated, time-taking analytics. For the time being, enterprises find SQL on Hadoop tools useful, as the Gartner survey found.