TechAlpine – The Technology world

What is Apache HBase and when should you use it?

Overview: Apache HBase can be defined as the Hadoop database. It is a distributed, non-relational and open source database written in Java. It is developed based on the Google BigTable framework and runs on HDFS (Hadoop distributed file system). Apache HBase is used when you have a requirement of random, real time access to your large volume of data. HBase is a suitable candidate when you have hundreds of millions or billions of rows and enough hardware to support it. As HBase is based on HDFS and HDFS performs well when there is minimum 5 data nodes. So in short, HBase is a ‘data warehouse’ type framework which is distributed and suitable for processing large volume of data.

In this article I will explain the details with architectural concepts.

Introduction: Apache HBase is a NoSQL column oriented database management system which runs on top of HDFS. HBase does not support structured query language like SQL. HBase applications are all MapReduce tasks and written in Java. HBase supports applications written in REST, Thrift and Avro. Some of the important features in HBase are listed below.

• HBase supports automatic sharding.
• HBase supports HDFS as its distributed storage.
• HBase supports MapReduce for parallel processing of huge volume of data.
• HBase has support for Java client APIs.
• HBase supports strongly consistent read and write operations. It is suitable for high speed counter aggregation.

Difference between HBase and HDFS: We have described that HBase is based on top of HDFS. So you might have confusion that HDFS and HBase similar. But you need to remember that HDFS is not a simple file system, rather it is a distributed storage suitable for storing large volume of data. HDFS does not support fast record look up for large volume of data. But HBase works on top of HDFS and provides fast look up and update.

When should you use HBase?
HBase is a typical NoSQL and columnar data store. Selection of a NoSQL database and RDBMS depends upon the requirement of the application. So first, we should understand the requirement clearly and then select the database. If you just select a NoSQL DB without proper analysis then it might cause trouble for you. And it will also be a misuse of technology and resources. Following are some points which should be considered for selecting a NoSQL DB like HBase.

Volume: The volume of data is the first criteria for selecting a NoSQL DB. You should have endless data (millions or billions of rows) to process and store. If you only have a few thousands or million of rows then traditional RDBMS is the best fit. But if you select HBase for a small amount of data then the data will accumulate in a single node and the other nodes in the cluster will sit idle.

Hardware support: HDFS performs efficiently when there are at least 5 data nodes. As we know that HBase is based on HDFS, so you should have sufficient hardware support for implementing HBase DB.

No need for RDBMS features: Make sure that your application does not require extra features provided by typical RDBMS. The advanced features like transaction, complex query, triggers are not supported by HBase. So this is another important criterion for selection.

HBase Design Concepts:
The design concepts behind HBase are similar to HDFS and MapReduce framework. As all works in a distributed environment, the general design is based on master-slave architecture. HDFS works on NameNode and slave nodes, MapReduce works on JobTracker and TaskTracker slaves. Similarly HBase has the following master slave architecture.

  • Master node manages the cluster.
  • Region servers’ stores table data and work on the data.

As the master node is the main controller, HBase is very sensitive to the loss of its master node.

HBase Views: HBase is having a tabular view for storing data. The main concept is based on column family. The HBase table is made of rows, columns and each column belongs to a column family. The table row key is the primary key for table access. The row key can be anything and the rows are sorted by row key. Following are the two views which describe the concepts.

Conceptual View: In this section I will explain the conceptual view by taking an example. The table contains column families and column families contain columns. The convention is that a column is made of three parts – column family name, prefix and column name. The colon character (:) delimits the column family and column. Let me take one example, the table name is ‘hbasetable’ having two column families ‘colfamily1’ and ‘colfamily2’. The ‘colfamily1’ has two columns ‘name’ and ‘address’. The ‘colfamily2’ has one column ‘telno’. So the structure would be as shown below.

Table ‘hbasetable’

colfamily1: name = “Ricardo”

colfamily1: address = “MA, USA”

colfamily2: telno = “2235678”

The tabular view will look like below.

Row Key Time Stamp ColumnFamily colfamily1 ColumnFamily colfamily2
“Rowkey1” T1   colfamily2: telno = “2235678”
“Rowkey2” T2   colfamily2: telno = “9995678”
“Rowkey3” T3   colfamily2: telno = “8896578”
“Rowkey4” T4 colfamily1: name = “Ricardo”  
“Rowkey5” T5 colfamily1: address = “MA, USA”

 

 

Table 1: Tabular view of ‘hbasetable’

Physical View: We have already discussed the conceptual view of HBase table and its contents. But the physical view is a bit different. Physically, the HBase tables are stored on a column family basis. So the new columns can be added easily without any prior notification. This feature adds the flexibility of linear scalability which we have discussed earlier.

Following are tabular view of two column families.

Row Key Time Stamp ColumnFamily colfamily1
“Rowkey4” T4 colfamily1: name = “Ricardo”
“Rowkey5” T5 colfamily1: address = “MA, USA”

 

Table2: Showing colfamily1

 

Row Key Time Stamp ColumnFamily colfamily2
“Rowkey1” T1 colfamily2: telno = “2235678”
“Rowkey2” T2 colfamily2: telno = “9995678”
“Rowkey3” T3 colfamily2: telno = “8896578”

Table3: Showing colfamily2

Please note that the empty cells displayed in the conceptual view are not actually stored. The storing is only allowed for a column oriented storage structure. So if we query some data at particular time stamp ‘T1’ from ‘colfamily1’, then it would return nothing. The same it true for ‘colfamily2’ also, all time stamps are stored in a descending order. So most recent value from a particular column would be returned if no time stamp is mentioned in a query.

Conclusion: Before concluding the discussion, we should keep in mind that HBase is an open source, NoSQL distributed database suitable for storing and processing endless amount of data. It is developed under Apache Hadoop project and based on HDFS framework. HBase operations are all MapReduce tasks which run in a parallel way. The basic concept is same as Google’s BigTable. The selection of NoSQL database should be done carefully. The RDBMS design and NoSQL design is completely different so porting data from RDBMS to HBase is not possible. The entire design has to be changed to shift from RDBMS to NoSQL HBase.

Tagged on:

Leave a Reply

Your email address will not be published. Required fields are marked *


− 1 = 4

TechAlpine Books
-----------------------------------------------------------