Kudu is the new open source project which provides updateable storage. It is a complement to HDFS/HBase, which provides sequential and read-only storage. Kudu is more suitable for fast analytics on fast data, which is the demand of the business in present times. So Kudu is not just another eco-system project of Hadoop, but rather has the potential to change the market.
What is Kudu?
Kudu is a special kind of storage system which stores structured data in the form of tables. Each table has numbers of columns which are predefined. Every one of them has a primary key which is actually a group of one or more columns of that table. This primary key is actually made to add a restriction and secure the columns and also work as an index, which allows easy updating and deleting. These tables are a series of data subsets called Tablets.
Kudu can protect data in cases of hardware or software failures by replicating multiple instances of this data on different nodes. For this, Kudu uses the Raft consensus algorithm. These Tablet sizes many gigabytes. Each node can hold up to 100 Tablets at once. Also, Kudu has a special component which effectively manages all the metadata. This is important for defining the data stored in the nodes, preventing additional damage in case of failures and to keep track of the changes made.
What is the current status of Kudu open source project?
Kudu is really well developed and is already coupled with a lot of features. However, it will still need some polishing, which can be done more easily if the audience suggests and makes some changes.
Kudu is completely open-source and has the Apache Software License 2.0. Also, it is intended to submit the project to Apache, so that it can be developed as an Apache Incubator project. This will allow for its development to progress even faster and further grow the audience. After a certain amount of time, Kudu’s development will be made publicly and transparently. Many companies like AtScale, Xiaomi, Intel and Splice Machine have joined hands to contribute in the development of Kudu. Kudu also has a large community, where a large number of audiences are already providing their suggestions and contributions. So, it’s the people who are driving Kudu’s development forward.
How can Kudu complement HDFS/HBase?
Kudu isn’t made to be a replacement for HDFS/HBase. It is actually designed to support both HBase and HDFS and run alongside them to increase their features. This is because HBase and HDFS still have many features which make them more powerful than Kudu in certain machines. On the whole, such machines will get more benefits from these systems.
Before the news of Kudu’s development was made public, an organisation named VentureBeat contemplated on the consequences of the large-scale adoption of Kudu and said that the bigger data warehousing systems like PureData and Teradata will be at risk. Teradata, however, thought that it is better to use Kudu as a compliment to its own system. So, it has an agreement with Hortonworks and has added a support for Hadoop in many of its software.
Also, Kudu will be able to handle parallel processing easily and will have a large memory database, unlike Vertica and VoltDB. However, Kudu is still going towards this direction, just like the fact that Kudu is still being developed to surpass the abilities of HBase and HDFS.
Kudu has a special design that can fit in perfectly with Hadoop’s architecture. So, it is equally easy to integrate Kudu with any Hadoop-based application. The Java client of Hadoop can be used for streaming large amounts of data in real time, and processing can be done quickly by Spark or Impala. The data from HDFS and HBase can be linked with Kudu tables too, and it can share data with HDFS DataNodes.
Features of the Kudu framework
The main features of the Kudu framework are as follows:
Extremely fast scans of the table’s columns: The best data formats like Parquet and ORCFile needs the best scanning procedures, which is addressed perfectly by Kudu. Such formats need quick scans which can occur only when the columnar data is properly encoded.
Reliability of performance: The Kudu framework increases Hadoop’s overall reliability by closing many of the loopholes and gaps present in Hadoop.
Low-latency random access: Kudu isn’t a normal file format only, but it’s a dynamic and powerful storage system which can give high-speed access to any row, any column or any cell.
Easy integration with Hadoop: Kudu can be easily integrated with Hadoop and its different components for more efficiency.
Completely open source: Kudu is an open source system with the Apache 2.0 license. It has a large community of developers from different companies and backgrounds, who update it regularly and provide suggestions for changes.
How can Kudu change the Hadoop ecosystem?
Kudu was built to fit in Hadoop’s ecosystem and enhance its features. It can also integrate with some of Hadoop’s key components like MapReduce, HBase and HDFS. MapReduce jobs can either provide data or take data from the Kudu tables. These features can be used in Spark too. A special layer makes some Spark components like Spark SQL and DataFrame accessible to Kudu. Though Kudu hasn’t been developed so much as to replace these features, it is estimated that after a few years, it’ll be developed enough to do so. Until then, the integration between Hadoop and Kudu is really very useful and can fill in the major gaps of Hadoop’s ecosystem.
Where can Kudu be implemented?
Kudu can be implemented in a variety of places. Some examples of such places are given below:
Streaming inputs nearly in real-time: In places where inputs need to be received ASAP, Kudu can do a remarkable job. An example of such a place is in businesses, where large amounts of dynamic data floods in from different sources, and needs to be made available quickly in real-time.
Time-series applications with varying access patterns: Kudu is perfect for time-series based applications because it is simpler to set up tables and scan them using it. An example of such usage is in shopping malls, where old data has to be found quickly and processed to predict future popularity of products.
Legacy systems: Many companies which get data from various sources and store them in different workstations will feel at home with Kudu. Kudu is extremely fast and can effectively integrate with Impala to process data on all the machines.
Predictive modelling: Data scientists who want a good platform for modelling can use Kudu. Kudu can learn from every set of data fed into it. The scientist can run and re-run the model repeatedly to see what happens.
Even though Kudu is still in the development stage, it has enough potential to be a good add-in for standard Hadoop components like HDFS and HBase. It has enough potential to completely change the Hadoop’s ecosystem by filling in all the gaps and also adding some more features. It is also very fast and powerful and can help in quickly analysing and storing large tables of data. However, there is still some work left to be done for it to be used more efficiently.