TechAlpine – The Technology world

What is Apache Shark?

Overview: Apache shark is a distributed query engine developed by the open source community. This query engine is mainly used for Hadoop data. It provides enhanced performance and high-end analytical results to Hive users.

In this document, I will talk about Apache shark and its features in details.

Introduction: Apache shark is a data warehouse based system used with Apache spark. This is designed to be compatible with Apache Hive. Shark has the ability to execute HIVE QL queries up to 100 times faster than Hive without making any change in the existing queries. Shark supports most of the Hive’s features like query language, metastore, serialization formats, and user-defined functions. Hence it makes the integration of existing Hive deployments easier.

Important features of Apache Shark:

Apache shark comes up with the following important features –

  • Faster Execution Engine – Apache Shark is built on top of Apache Spark which is a parallel data execution engine. Even if the data is on the disk, because of the faster execution engine, shark is relatively fast than its competitors. Shark avoids the overhead of Hadoop map reduce. With its faster engines, shark can respond to complex queries in sub-second latency.
  • Column wise memory Store – Data analysis mechanism focuses on a smaller set of data e.g. it can be time based or a locality based. Thus we need to touch only a small set of dimension tables or a certain portion of the fact tables. These queries execute only within temporal locality. This enables to fit the working set into cluster’s memory.

As a user, we can use this temporal locality by storing the working set of data within the cluster’s memory, or in a database by having in-memory materialized views. The commonly used data types can also be cached in columnar format e.g. primitive arrays, which are very efficient for data storing and garbage collection. This provides maximum performance, as the data is fetched from the tables and not from the disk.

Setup and execute locally:

Prerequisite – Before setting up shark on your computer make sure you have the following installed on your system –

The binary distribution of Shark can be downloaded from the official website of github amplab. The binary package contains two folders –

  • shark-0.8.0
  • hive-0.9.0-shark-0.8.0-bin

We need to set up the following environment variables in order to do the setup –

  • JAVA_HOME
  • HIVE_HOME
  • SCALA_HOME

Shark comes with a template env file – shark-env.sh.template. Make a copy of this template file in the location – shark-0.8.0/conf. The name of the env file should be shark-env.sh. Once the environment variables are created, we need to create the default HIVE warehouse directory. This is the location where, HIVE stores the table data for native tables. While creating this directory make sure that the owner of these directories are same which is doing the shark setup.

Now we are ready with our shark setup. Run the following command –

Listing 1 – Starting up Shark command line interface

./bin/shark

In order to verify that shark is up and running, let us run the following example which creates a table with some sample data –

Listing 2 –Sample code to create a simple table and then load some data

CREATE TABLE SOURCE_MAP (key INT, value STRING);

LOAD DATA LOCAL INPATH ‘${env:HIVE_HOME}/examples/files/kv1.txt’ INTO TABLE

SOURCE_MAP;

SELECT COUNT(1) FROM src;

CREATE TABLE SOURCE_MAP_cached AS SELECT * FROM SRC;

SELECT COUNT(1) FROM SOURCE_MAP_cached;

In addition to the above shark command, we have several other executables as mentioned below –

  • bin/shark-withdebug – This runs the shark command line interface with debug level logs printed on the console.
  • bin/shark-withinfo – This runs the shark command line interface with info level logs printed on the console.

Following the above mentioned steps we can setup shark on a single node. In order to run shark on a cluster, let us follow these steps.

 Prerequisite – Before setting up shark on your computer make sure you have the following installed on your system –

Unlike the earlier versions of shark and spark, the latest version does not require Apache Mesos anymore.

First let us make some changes in the spark environment –

  • Master slave entry – The spark slaves file – spark-0.8.0/conf/slaves needs to be modified to add the host name of each slave. It should be single line entry per slave.
  • Spark env file – The spark env file – spark-0.8.0/conf/spark-env.sh needs to have the following entries –
    • SCALA_HOME – as explained above.
    • SPARK_WORKER_MEMORY – This is the maximum amount of memory which spark can use on every single node. While setting this parameter we must be careful and be sure to leave at least 1 GB memory for OS to function properly.

Now let us do the setting related changes in the shark environment –

As mentioned above, download the binary distribution of Shark from the official website of github amplab. The binary package contains two folders –

  • shark-0.8.0
  • hive-0.9.0-shark-0.8.0-bin

Now open the hark-env.sh file and edit the following properties as per our environment –

  • JAVA_HOME
  • HIVE_HOME
  • SCALA_HOME
  • MASTER environmental values.

The master URL should exactly match with the spark:// URI mentioned at port 8080 of the standalone master. The shark-env.sh file should look as under –

Listing 3 – Shark env file incase of clustered setup –

HADOOP_HOME=/path/to/hadoop

HIVE_HOME=/path/to/hive

MASTER=<Master URI>

SPARK_HOME=/path/to/sparkS

PARK_MEM=16g

source $SPARK_HOME/conf/spark-env.sh

The last line added here is to avoid duplicate entries for SPARK_HOME. Once these parameters are added make sure to export them using the standard export command of unix. It must be noted that the amount of memory mentioned under parameter – SPARK_MEM should not be higher than the SPARK_WORKER_MEMORY mentioned above. If you want to use shark with an existing HIVE setup, make sure to set the HIVE_CONF_DIR parameter in the shark-env.sh file.

Next step is to copy the spark and shark directories to the slaves. Once done we can start the cluster by executing the command –

Listing 4 – Launch the spark cluster –

/bin/start-all.sh

The Shark Query Language:

Shark has its own subset of SQL which is very much close to the query language implemented by HIVE. E.g. to create a cached table from the rows of an existing table, we need to set the shark.cache table property as shown below –

Listing 5 – A sample shark query

CREATE TABLE … TBLPROPERTIES (“shark.cache” = “true”) AS SELECT …

We can also extend HiveQL to have a shortcut for this syntax. Simply we need to append _cached’ to the table name while using CREATE TABLE AS SELECT, and the table will be cached in the memory.

Summary:

Let us conclude our discussion in the form of following bullets –

  • Apache shark is a distributed query engine developed by the open source community
  • Apache shark is a data warehouse based system to be used with Apache Spark.
  • Apache shark is compatible with HIVE QL and can be easily integrated with HIVE.
  • It can run on both standalone mode and clustered mode.
Tagged on:

Leave a Reply

Your email address will not be published. Required fields are marked *


4 + 2 =

TechAlpine Books
-----------------------------------------------------------