Apache Sqoop is a tool used for transferring data from/to Hadoop distributed file system. Hadoop architecture can process BIG data and store it in HDFS. But if we want to use that data then we need to use some tool to import/export it efficiently. Apache Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems.
Apache Sqoop is very important when we think about using Hadoop for analytics and data processing.The two main aspects which Sqoop address are
a) Loading bulk(production) data into Hadoop.
b) Accessing bulk data from map/reduce applications running on large clusters.
Earlier we used to write/use scripts to import/export data between different systems.But this process is inefficient and does not ensure data consistency, accuracy and other critical points.
Sqoop uses straight forward mechanism to transfer data.The entire dataset is splitted into slices and each slice is a map-only job.Now each map-only job is responsible for transferring one slice of the data-set.
As we have discussed, Sqoop can be used to import data from a RDBMS into HDFS.The input to the import process is a database table and Sqoop reads table row by row into HDFS.The input process is performed in parallel so the output will be multiple files.These output files can be text files or other type of files containing serialized data.
There is a by-product of the Sqoop import process.It is a Java class which can encapsulate one row of the imported table.This Java class is used by the Sqoop itself durng import process.The source code of this by-product Java class is also available for customized use.
After processing the imported data, it can be exported to any relational database using Sqoop. Sqoop will read a set of delimited text files from HDFS (in parallel) and insert them as new rows to the
target table.Now these data is available for consumtion by the external applications.
Sqoop also provides some command utilities to get information about the databaes on which it is working.The list of database schemas, tables can also be viewed using Sqoop commands.Sqoop also provides primitive SQL execution shell.
Sqoop operations like import,export,code generation etc can be customized.For import, row ranges/columns can be specified.The delimiters, escape characters for file based representation can also be changes as per the requirement.The package/class name of the generated code can also be customized to meet the application requirement.
Sqoop connectors are another important part of the tool.Connectors are plugin components built on Sqoop’s extension framework.These connectors can be added to any Sqoop installation and then data can be transferred between Hadoop and external store.
Sqoop comes with default connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2.Sqoop also includes a generic JDBC connector which can be used to connect to any database accessible via JDBC.
To conclude this discussion we can say that Sqoop can be used to transfer large datasets between Hadoop and external datastores efficiently.Beyond this, Sqoop also offers many advanced features like different data formats,compression,customization,working with queries etc.