Home
Blog
Hadoop
Apache Sqoop Tutorial

Apache Sqoop Tutorial

Apache's Sqoop framework is an open-source framework. It's a command-line tool for moving data between relational databases and Hadoop. This Sqoop tutorial will show you how to manage Big Data with Hadoop's Sqoop tool. It covers everything from the fundamentals of Sqoop's architecture to how to use it.

Rating: 4

3847

Get Trained And Certified

Hadoop Sqoop

SQOOP is a tool designed to transfer data between Hadoop and relational databases.
We can use sqoop to import data from a relational database management system such as Mysql or oracle into the HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Also, we can use sqoop to transform the data in Hadoop Map Reduce and then export the data back to RDBMS
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported.
Sqoop uses map-reduce to import and export the data which provides parallel operation as well as fault conditions.
After manipulating the imported records with map-reduce or Hive, you may have a result data set which you can then export back to the relational database
scoop also includes a primitive SQL execution shell.

Interested in mastering MapReduce? Enroll now for a FREE demo on MapReduce training

Direct-mode Imports:

When using direct mode, you can specify additional arguments which should be passed to the underlying tool
If the argument – – is given on the command line, then subsequent arguments are sent directly to the underlying tool
Mysql Provides the Mysql dump tool which can export data from Mysql to other systems very quickly.
By supplying the—direct argument, you are specifying that sqoop should attempt the direct import channel and this channel may be of higher performance than using JDBC

For example, the following adjusts the character set used by Mysql dump.

sqoop import - - connect jdbc: Mysql://server.foo.com/db --table bar-- direct—default-character-set=latin1

Note:-When using the direct mode of import, certain data back client utilities are expected to be present in the shell path of the task process

Mysql – Mysql dump Mysql import

Exports and transactions:-

Export is performed by multiple writers in parallel
Each writer uses a separate connection to the database and these have separate transactions from one another.
Sqoop uses multi-row INSERT Syntax to insert up to 100 records per statement.
For every 100 Statements, the current transaction within a writer’s task is committed causing a commit for every 10,000 rows.
This ensures that transaction buffers do not grow without bond and cause out-of-memory conditions.
Their fore export is an atomic process and partial results of the export will become visible before the export is complete.

Ex:- Sqoop export – connect jdbc: Mysql://db.example.com/foo

         –Table bar–export-dir/results/bar-data

Frequently Asked MapReduce Interview Questions & Answers

Incremental imports:-

Sqoop provides an incremental import mode, which can be used to retrieve only rows newer than some previously – imported set of rows

The following arguments control, incremental imports:

Arguments:–

1) check—column: specifies the column to be examined when determining which rows to import.

2) incremental(mode): specifies how sqoop determines which rows are new and legal values for mode include append and last modified.

3) last-value(value): Specifies the maximum value of the check column from the previous import.

sqoop supports two types of incremental imports: append and last modified and you can use the – incremental argument to specify the type of incremental import to perform.

Ex:- $ sqoop import – connect jdbc:Mysql://db.foo.com/
Some db—table some table--where” id>10000”
--Target-dir/ incremental-dataset -- append

Sqoop internals:-

It describes the internal architecture of sqoop

General program flow
Sub packages
Interfacing with map reduces.
Sqoop program is driven by the com. Cloudera. sqoop. Sqoop

A limited number of additional classes are in the same package

Related Blog: Apache NiFi Tutorial

1. General program flow

Com. Cloudera. sqoop is the main class and implements Tool.
A new instance is launched with tool runner
The first argument to Sqoop is a string identifying the name of a Sqoop tool to run
Sqoop tool itself drives the execution of the user request operation (Ex:-import, export, code gen, etc)
We can set the appropriate fields in the sqoop options class to run its body.
ConnFactory is used to get a connManager from a manager factor.
ConnManager . import Table() method is left to determine how best it is to run the import
Each main action is actually controlled by the conn manager except for the generating of code, which is done by the connection manager and class writer. Both are in the com.cloudera.sqoop.Hive.Hiveimport class after the importable() has computed.
Importing into Hive is also taken care of via the com. Cloudera. Sqoop.Hive. Hive import class after the import table() has been completed.
Import table() method receives a single argument of type import job content which contains parameters to the method
Similarly, the export table() method receives an argument of the type ExportJobContent.

2) Sub packages:-

The following sub-packages under com. Cloudera. Sqoop exists.

1. Hive: Facilitates importing data to Hive.
2. To Implementations of Java. io* interfaces, namely output stream, and writer.
3. Lib: the external public API(described earlier)
4. Manager: The connManager and ManageFactory interface and their implementations.
5. Map-reduce: classes interfacing with the new(0.20+)
6. Orm-code auto-generation and depends on JDK j tools.jar which provides com. sun. tools. package
7. Tool- implementations of sqoop tool
8. Util- miscellaneous utility classes.

3) Interfaces with Map Reduce:-

sqoop schedules Map Reduce jobs to affect imports and exports
Configuration and execution of map-reduce jobs follow a few common steps. They are:

1. Configuring the input format
2. Configuring the output format
3. Setting the mapper implementation

These steps are formalized in the com. Cloudera. Sqoop. Mapreduce. Job Base class
The JobBase allows a user to specify the input format, output format, and mapper to use
The JobBase itself is subclassed by import job base and Job base, which after better support for the particular configuration steps common to import or export-related jobs respectively
ImportjobBase-runImport() will call the configuration steps and run a job to import a table to HDFS

àFor example. DataDrivenImportJob uses the DataDrivenDBInput Format to run an import.

àMysql uses a different class(MysqlDumpImportJob)to run a direct-mode import.

Importing data into Hive:-

We can create Hive schema using sqoop
sqoop import tools main function B to upload your data into files in HDFs
If you have a Hive megastore associated with your HDFS Cluster, sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data layout in Hive.
Importing data into Hive is as simple as adding the –Hive—import option to your sqoop command line
If the Hive table already exists, you can specify the –Hive-over writes option to indicate that the existing table in Hive must be replaced.
After your data is imported into HDFS, Sqoop will generate a Hive savior containing a CREATE TABLE Operation defining your columns using Hive types and LOAD DATA IMPATH statement to move the data into the live warehouse directory.

Ex: # sqoop create-Hive- table—connect jdbc:Mysql:
//local host:3306/test—table sample—fields
-terminated-by’1’
To create a Hive table without data
# sqoop import--connect jdbc: Mysql://local host:3306
/test – table sample-m 1—Hive-import

To create a table in Hive with data

Importing data into HBase:-

Sqoop supports additional import targets beyond HDFS and Hive
Sqoop can also import records in a table in HBase
By specifying –HBase- table, you instruct sqoop to import a table in HBase rather than a directory in HDFS
Sqoop will import data to the table specified as the argument to – h base- table
Each row of the input table will be transformed into an HBase put operation to a row of the output table
The key for each row is taken from a column of the input
You can manually specify the row key column with – h base-row-key.
Each output column will be placed in the same column family, which must be specified with – column-family
We should create the target table and column family before running an import
If you specify – hbase-create- table, sqoop will create the target table and column family if they do not exist,

Ex:# sqoop import—connect jdbc: Mysql://local host:3306
/test—table ORDERS—hbase-create-table –hbase-table ORDERS—Column-family Mysql.

Hadoop Administration	MapReduce
Big Data On AWS	Informatica Big Data Integration
Bigdata Greenplum DBA	Informatica Big Data Edition
Hadoop Hive	Impala
Hadoop Testing	Apache Mahout

On-Job Support Service

Online Work Support for your on-job roles.

@Learner@SME

Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:

Pay Per Hour
Pay Per Week
Monthly

Learn MoreContact us

Course Schedule

Name	Dates
Hadoop Training	Apr 22 to May 07	View Details
Hadoop Training	Apr 26 to May 11	View Details
Hadoop Training	Apr 29 to May 14	View Details
Hadoop Training	May 03 to May 18	View Details

Last updated: 04 Apr 2023

About Author

Ravindra Savaram

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less

Recommended Courses

Denodo Training

4.6

532

Elasticsearch Training

4.6

824

1 / 15

Hadoop Articles

Hadoop Quiz

Test and Explore your knowledge