Overview: Apache HBase is one of the most popular non-relational databases built on top of Hadoop and HDFS (Hadoop Distributed File system). It is also known as Hadoop database. As an Apache project, HBase is an open-source, versioned and distributed NoSQL DB written in Java language. It is built by following Google’s Bigtable concepts. Apache HBase is suitable for use cases where you need real time and random read/write access to huge volume of data (Big data). As HBase runs on top of HDFS, the performance is also dependent on the hardware support. We need to provide sufficient number of nodes (minimum 5) to get a better performance.
In this article, we will explore different aspects of HBase and its applicability.
What are the features of Hbase NoSQL DB?
Apache HBase is a column oriented database which supports dynamic database schema. It mainly runs on top of HDFS and supports MapReduce jobs. HBase also supports other high level languages for data processing.
Let us have a look at different features of HBase as mentioned below.
Scalability: HBase supports scalability in both linear and modular form
Sharding: HBase supports automatic sharding of tables. It is also configurable.
Distributed storage: HBase supports distributed storage like HDFS
Consistency: It supports consistent read and write operations
Failover support: HBase supports automatic failover
API support: HBase supports Java APIs so clients can access it easily
MapReduce support: HBase supports MapReduce for parallel processing of large volume of data
Back up support: HBase supports back up of Hadoop MapReduce jobs in HBase tables
Real time processing: It supports block cache and Bloom filters. So real time query processing is easy
Apart from the above major features, HBase also supports REST-ful web services, jruby-based shell, Ganglia and JMX. So, HBase has a very strong presence in NoSQL database world.
Is HBase a replacement of HDFS?
HBase is a NoSQL data base and it works on top of HDFS. So, sometime people think that HBase is a replacement or substitute for HDFS. But they are fundamentally different. HDFS is a distributed storage which spans across multiple commodity hardware. It is the Hadoop file system and works as a generic storage for any type of Hadoop application. But HBase is a non-relational database which uses HDFS as storage for keeping its data. It can be compared with any relational database and its storage in normal/local file system. So we can conclude that HBase is not a replacement but they work together and complement each other.
You can also check Introduction to NoSQL
How HBase works?
HBase scales in a linear way, so all the tables should have a primary key. All the key spaces are distributed into sequential blocks and these blocks are allotted to regions. Now, these regions are controlled by RegionServers to distribute the load uniformly in a clustered environment. HBase supports automatic data sharding, so manual intervention is not required.
After deploying HBase, Zookeeper and HMaster servers are configured to provide cluster topology information to the HBase clients. Client applications connect to these utilities and get the lists of RegionServers, regions and key ranges information. It helps the client to know exact data position and connect to RegionServer directly. RegionServers also provide caching (by using memstore) support for frequently accessed rows. It improves the performance.
You may like to read Apache Hadoop components
What are the supporting services?
If we are convinced to select HBase as the NoSQL data base for our application, then we must remember the requirement of supporting services also. Only HBase implementation will not serve the complete picture. The most important supporting service is the coordination service in the distributed environment. Zookeeper is the best coordination utility used with HBase. The other important service part is the networking area. Network services like NTP and DNS should be in place to help smooth synchronization among different nodes. HBase is distributed in a clustered environment, so it should be properly networked with coordination services. NTP (Network Time Protocol) is a network protocol for synchronizing clock timings between different connected systems. As HBase is distributed among nodes, the clock synchronization is very important while referring each other. The DNS (Domain Name System) along with NTP ensures smooth and efficient functioning of HBase.
So we can understand that monitoring is the most important service while deploying HBase. Each and every node should be monitored for CPU usage, latency, I/O activities and bandwidth.
When should you use HBase?
After going through the above sections, we have got some idea about HBase. We also know the supporting services and the key considerations for HBase deployment. As a NoSQL DB, HBase offers lot of good functionalities, but it is still not the ‘Fit for All’ solution. Following are some of the key areas to be considered before finalizing HBase for your application.
Data volume: The volume of data is the most common point to be considered. You should have peta bytes of data to be processed in a distributed environment. Otherwise, for small amount of data, it will be stored and processed in a single node, keeping other nodes idle. So it will be a misuse of technology framework.
Application Types: HBase is not suitable for transactional applications, large volume MapReduce jobs, relational analytics etc. It is preferred when you have variable schema with slightly different rows. It is also suitable when you are going for a key dependent access to your stored data.
Hardware environment: HBase runs on top of HDFS. And HDFS works efficiently with large number of nodes (minimum 5). So, if you have good hardware support, then HBase can be a good selection.
No requirement of relational features: Your application should not have any requirement for RDBMS features like transaction, triggers, complex query, complex joins etc. If you can build your application without these features then go for HBase.
Quick access to data: If you need a random and real time access to your data then HBase is a suitable candidate. It is also a perfect fit for storing large tables with multi structured data. It gives ‘flashback’ support to queries, which makes it more suitable for fetching data in a particular instance of time.
Apart from the above points, HBase is also suitable when you need fault tolerant, fast and usable data management in a non-relational environment.
What are the recent progresses in HBase?
Following are some of the recent improvements in HBase.
- Improved high availability
- HBase and YARN integration
- Blockcache compression
- Support to data types
- Support to rolling upgrades
Some use cases
There are a lot of real-life implementations of HBase. Some of the important use cases are
- Use of HBase by Mozilla: They generally stores all crashes data in HBase
- Use of HBase by Facebook: Facebook uses HBase storage to store real-time messages.
Throughout this article, we have discussed different features of HBase, its working methodology and the implementation areas. We have also checked the recent improvements and some of the use cases. In short, we can conclude that HBase is a key-value NoSQL database and a good fit for real-time queries. So, HBase along with its eco-system products (like Zookeeper, HMaster etc.) can be a complete solution for NoSQL deployments. But again, before finalizing, we should evaluate it based on the application requirement.
Read more about Apache HBase