How can you manage large volume of data using Apache Cassandra NoSQL database?

Overview: Apache Cassandra is one of the most popular and scalable open source NoSQL database. Cassandra is an ideal database for managing huge volume of unstructured, semi-structured and structured data across multiple data centers and the cloud environment. Cassandra delivers high scalability and availability across many commodity servers without compromising performance. With this model there is no single point of failure, and it provides a powerful data model for maximum flexibility and fast response time. Linear scalability and a fault tolerant hardware or a cloud infrastructure makes a perfect combination for any critical data.

Introduction: Relational databases are very good in solving certain type of data storage problems. But as the focus is different for RDBMS, it creates problem when scaling up for large volume of data. So, we need to find a way to get rid of the joins. This will result in de-normalizing the data. This will lead to maintain multiple copies of data and also cause a huge damage to the design, both in the database and in the application. In this condition solutions provided by NoSQL seems to be less radical and less scary than we may have thought. The design goal of NoSQL database has to understand clearly before implementing it in any application.

Design goals of Cassandra NoSQL database: The design goals of NoSQL database are completely different from relational database. So the choice of using NoSQL DB or RDBMS also depends upon the type of application and its requirement. As we know that ACID transaction provides a strong consistency model for all web applications developed and designed traditionally. But when we think about scalability, it comes at a cost and conflicts some of the rules followed in RDMBS design. So promoting availability over consistency is one of the key design factors for NoSQL databases. Common design goals followed of Cassandra are stated below.

High performance
Horizontal scalability
Simplicity
Schema flexibility

Cassandra architecture to manage large data volume: As we all know that NoSQL databases are distributed on a number of commodity nodes. Cassandra is also distributed on a number of nodes and it follows ‘masterless’ architecture. ‘Masterless’ architecture means, all nodes are same and there is no single node which controls other nodes. Cassandra automatically distributes data across all the commodity nodes which forms the ‘ring’ known as database cluster. As the data is automatically and transparently partitioned on the cluster, developers do not need to do anything programmatically. Another important feature of Cassandra architecture is its support for in-built and customizable replication. The redundant data is stored across multiple nodes in the Cassandra ring. So if there is any failure in any node, the same data is retrieved from other nodes having replicated data. The replication can be configured in the following ways.

Across one data center
Across multiple data centers
Across multiple cloud infrastructure

Another architectural feature is the support for linear scalability. It means the capacity or scalability can be increased by just adding new nodes. For example, if 2 nodes can handle 1000 transactions/sec, then 4 nodes will support 2000 transactions/sec and so on. Following picture shows the linear scalability of a Cassandra ring.

Cassandra Ring

Accessing large volume of data: The first thing which comes into mind is the availability of different client libraries when developing database driven application. For RDBMS products the available libraries are straight forward. For example, JDBC is the standard database access API for Java based applications. Normally there is a single JDBC driver vendor for a particular type of database product. On the other hand, Cassandra has approximately nine different clients for Java application development. And the most important thing is that, these clients provide different flavors for managing the data. Some are providing object relational mapping APIs, some are offering CQL based support and many more. So the flexibility for accessing the NoSQL DB is another major advantage for application development. The developers can choose the type of access according to their requirement.

Large volume of data in Cassandra can be accessed and managed by APIs which follows RPC style. At the same time, Cassandra also provides basic query language support called CQL which is similar to SQL to some extent. But the application developer must have a sound knowledge about the storage engine and its functionality.

Standard use cases for Cassandra NoSQL DB: As we have already discussed that the standard use cases for Cassandra is different from traditional RDBMS applications. Following are some standard use cases.

Applications handling very large data volume
Applications of high scalability and availability
Applications with high reliability requirement for data storage
Dynamic data model which is expected to change significantly over time
Distribution over different datacenters

Downloading and Installing Cassandra: Now let us discuss about the download and installation part of Cassandra NoSQL DB. The download and installation will take some time.

Apache Cassandra can be downloaded from http://cassandra.apache.org . The binary distribution is named as apache-cassandra-<VERSION>-bin.tar.gz. Easiest way to install Cassandra is mentioned in the following steps below –

Download the binary distribution from the above website
Unzip this using some regular ZIP utility
Once unzipped, you should get the following directories –
- bin – this contains the executables to run Cassandra and the command line interface client.
- conf – this contains files used to configure Cassandra
- interface – interface is defined using the Thrift syntax and provides an easy way to generate clients. If you want to see all of the operations that Cassandra supports, open this file by using a regular text editor. The file will have all Cassandra supports clients for Java, C++, PHP, Ruby, and Python, Perl, and C # through this interface.
- lib – This contains the external which are required to execute Cassandra.
- javadoc – This contains the documentation in html format for Cassandra.

Start the Cassandra NoSQL server: Please follow instructions below to start the Cassandra server.

To start the Cassandra server on any OS like linux or windows, first you need to open a command prompt or terminal window. Now go to the <cassandra-directory>/bin where you unpacked Cassandra, and run the following command to start the Cassandra server. If the installation was clean, we would see some log statements like this:

Listing 1: starting Cassandra server

utpalb@Cassandraserver$ bin/cassandra -f

INFO 13:23:22,367 DiskAccessMode ‘auto’ determined to be standard, indexAccessMode is standard

INFO 13:23:22,475 Couldn’t detect any schema definitions in local storage.

INFO 13:23:22,476 Found table data in data directories. Consider using JMX to call org.apache.cassandra.service.StorageService.loadSchemaFromYaml().

INFO 13:23:22,497 Cassandra version: 0.7.0-beta1

INFO 13:23:22,497 Thrift API version: 10.0.0

INFO 13:23:22,498 Saved Token not found. Using qFABQw5XJMvs47lg

INFO 13:23:22,498 Saved ClusterName not found. Using Test Cluster

INFO 13:23:22,502 Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1282508602502.log

INFO 13:23:22,507 switching in a fresh Memtable for LocationInfo at CommitLogContext(

file=’/var/lib/cassandra/commitlog/CommitLog-1282508602502.log’, position=276)

INFO 13:23:22,510 Enqueuing flush of Memtable-LocationInfo@29857804(178 bytes, 4 operations)

INFO 13:23:22,511 Writing Memtable-LocationInfo@29857804(178 bytes, 4 operations)

INFO 13:23:22,691 Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-1-Data.db

INFO 13:23:22,701 Starting up server gossip

INFO 13:23:22,750 Binding thrift service to localhost/127.0.0.1:9160

INFO 13:23:22,752 Using TFramedTransport with a max frame size of 15728640 bytes.

INFO 13:23:22,753 Listening for thrift clients…

INFO 13:23:22,792 mx4j successfuly loaded HttpAdaptor version 3.0.2 started on port 8081

The -f option used here tells Cassandra to stay in the foreground instead of running as a background process. This helps us, so that all of the server logs will print to standard out and you can see them in your terminal window, which is useful for testing.

Summary:

Let us conclude our discussion in the form of following bullets –

Apache Cassandra is a scalable NoSQL based database
It can be downloaded and installed from the Apache website
Cassandra is an ideal database for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud.
Cassandra supports linear scalability and high performance across multiple commodity servers with no single point of failure, and provides a powerful dynamic data model designed for maximum flexibility and fast response time.

Share on Facebook

Save

Tagged on: Cassandra

TechAlpine – All About Technology

www.techalpine.com