How to Insert Data into Tables from Queries in Hadoop

Inserting Data into Tables from Queries

 Insert statement is used to load DATA into a table from query.

 Example for the state of Oregon, where we presume the data is already in another table called as staged- employees

 Let us use different names for the country and state fields in staged – employees, calling them cnty.

Ex:- INSERT OVER WRITE TABLE employees PARITION(country=’US’ ,state=’OR’)SELECT* 
FROM staged-employees se WHERE SE. cnty=’us’ ANO SE.ST=’or’;

 With OVERWRITE; previous contents of the partition or whole table are replaced.

 If you use INTO instead of OVERWRITE Hive appends the data rather than replacing it and it is available in Hive 0.8.0 version or later.

 You can mix INSERT OVER WRITE clauses and INSERT INTO Clauses as Well

Dynamic partition Inserts

  If we have lot of partitions to create, you have to write a lot of SQL.

5 Hive supports a dynamic partition feature, where it can infer the partitions to create a new based on query parameters.

 Example for dynamic partition and  static partition are:

Ex:-INSERT OVER WRITE TABLE employees PAITION(Country, state)
SELECT ----- Se cnty, se-st From staged- employees se;

 Hive determines the values of the partition keys, country and state from the last two columns in the SELECT clant

 This is why we use different names in staged-employees and also emphasize the relationship between source column values and output partition values.

 You can also mix .dynamic and static partitions and the variation of the previous query specifies a static value for the country(US) and a dynamic value for state as in below example.

Ex:- Insert over write table employees PARTITION(country=’us’state)
Select-------,se cnty, se.st FROM staged-employees se WHERE se. cnty=’us’;

 The static partition keys must come before the dynamic partition keys.

 Dynamic partitioning is not enabled by default.

 To enable the dynamic partition, we set the desired properties  as below.

hive>set. hive. exec. dynamic. partition=true;

 To enable dynamic partition partitioning, by default A B false.

hive>set. hive. exec dynamic. Partition. Mode=non string;To enable all partitions to be determined dynamically and 
by default, mode is strict.
hive>set. hive. exec .max. dynamic. Partition. Per code=1000;

 max. no of dynamic Partitions that can be created by each mapper and reducer Default as it is 100

hive>set. hive. exec. max. created. files

 Max total no. of files that are created globally in the Hadoop counter is used to track the no of files created.

MindMajix YouTube Channel

Creating Tables and loading them in one Query:-

 You can also create a table and insert query results into one statement.

Ex:-CREATE TABLE ca-employees AS SELECT name, salary, address FROM employees WHERE se. state=’CA’;

 This table contains just the name, salary and address column from the employee table records for employees in California.

 The schema for the new table is taken from the SELECT clause.

 Mainly, this feature is used to extract a convenient sub-set of data from a larger and more unwieldy table.

Querying Data in Hive

1) SELECT —FROM Clause:-

 SELECT is the projector operator in SQL FROM Clause. It identifies from which table view or nested query, we must select records.

Ex:-hive>create table filter sal tab  as select empid, emp name, esal from use tab where esal>14000;
hive>select * from filter sal tab;

2) GROUP BY Clauses:-

 GROUP BY Statement is often used in conjunction with aggregate functions to group the results set by one or more columns and then perform an aggregation over each group

Ex:-hive>SELECT year(ynd),arg(price-close)FROM stocks
WHERE exchange=’NASDAQ’ ANO Symbol=’AAPL’ Group by year(ymd);

3) HAVING Clauses:-

 The HAVING Clauses is user to constrain groups produced by GROUP in such a way that it could be expressed with a sub query.

 The previous query with an additional HAVING clause that limits the results to years, where the average closing price was greater than $500 as below

Ex:- hive>SELECT year(ymd),avg(price-close)FROM Stocks 
WHERE exchange=’NASDAQ’ AND Symbol=’AAPL’ Group by year(ymd) HAVING avg(price-close)>50.0;

4)  JOINS:-

 Hive supports the classes SQL JOIN Statement, but only equi-joins are supported.

Inner Join:

 In an inner join, records are discarded unless join criteria finds matching records in every table being joined.

Ex:-Select a-ymd, a price- close, b. price-close FROM stores a join  b on a. ymd=b.ymd
 WHERE a. symbol=’AAPL’ and b.symbol =’IRM’;

 On clerk specifies the conditions for joining records b/w the two tables.

JOIN Optimizations

 Hive can apply an optimization where it joins all three tables in a single mapreduce job.

 When joining three or more tables, if every clause uses the same join key, a single map reduce job will be used.

 Hive also assumes that the last table in the query be the largest table and attempts to buffer the other tables and then stream the last table through, while performing joins on individual records.

 We should switch the positions of stocks and dividends

Ex:-Select s. ymd, s. symbol, s-price-close, d. dividend From dividends 
d join stockless s on s. ymd= d.ymd

 You don’t have to put the largest table last in the query. Hive also provides a hint mechanism to tell the query optimizer about which table should be streamed.

Ex:- Select /*STREAM TABLE(S)*/s.ymd . s. symbol s.price-close, d.   from stocks  s  Join dividends d on=d. symbol WHERE s.symbol=’apl’

 Now, Hive will attempt to stream the stocks table; even though its not the last table in the query.

LEFT OUTER Join:-

 The left-outer join is indicated by adding the LEFT OUTER key words.

Ex:- hive>SELECT s.ymd, s. symbol ;s. price-close, d. dividend
From stocks s LEFT OUTER JOIN dividends d on
s.ymd= d.ymd AND s. symbol=d. symbol WHERE
S.symbol=’AAPL’;

 In this join, all the records from the left hand table that match the WHERE clause are returned.

RIGHT OUTER JOIN:-

 Right outer joins returns all the records in the right hand table that match the WHERE clause and null B used for fields of missing records in the left hand table.

Ex:-SELECT s.ymd, s. symbol, s. price-close d. dividend From dividends d RIGHT OUTER JOIN stocks s ON
d.ymd=s.ymd AND d.symbol= s. symbol WHERE
s.symbol=’AAPL’;
LEFT SEMI-JOIN:-

 A left semi-join returns records from the left hand table if records are found in the right hand table that the ON predicates.

 It is a special optimized case of the more general inner join.

Ex:-hive>SELECT s.ymd, s. symbol, s. price – close From stocks s LEFT SEMI JOIN dividends d On
s.ymd=d.ymd AND s.symbol=d.symbol;

Full outer join:-

 Resume all records from all the tables that match the WHERE clause.

Ex:- hive>SELECT s.ymd, s. symbol, s. price –close, d. dividend From dividends d Full outer join struck s ON
d.ymd=s.ymd and d.symbol=s.symbol
WHERE s.symbol=’APPL’;

Cartesian product Joins:-

 In Hive, the below query computes the full Cartesian product before applying the WHERE clause and it will take long time to finish.

 When the property hive.Mapred.mode is set to strict, Hive presents users by inadvertently issuing a Cartesian product query.

Ex:- Hive>SELECT* From stocks JOIN dividends
WHERE . stock. Symbol =dividends . symbol and
Stock. Symbol = ‘AAPL’;

Map- side Joins:-

 Hive can do all map side joins, since it can look up every possible match against the small tables in memory there by eliminating the reduce step required in the more common join scenarios.

 On smaller data sets, this optimization is faster that the normal join.

Ex:-SELECT/*+MAP JOIN(d)*/ s.ymd, s. symbol, s.price-close,
d. dividend from stocks s JOIN dividends d on
s- ymd=d.ymd AND s. symbol=d. symbol WHERE
s.symbol=’AAPL’;

 The hint still works, but its now deprecated as of Hive 0-7.0

 However, you still have to set a property to menu as  below before Hive attempts the optimization.

hive>set hit auto .convert. joins –true;
hive>SELECT s.ymd. s. symbol, s. price-close, d-dividend
FROM stocks s join dividends d On s.ymd=d.ymd
AND s. symbol=d. symbol where s. symbol =’AAPL’;

 Map joins can take advantage of bucketed tables, since a mapper working on a bucket of the left table only needs to load the corresponding buckets of the right table to perform the join.

 However, you need to enable the optimization with the problem

hive> set Hive. optimize. Bucket map join =true;

ORDER BY and SORT BY:

 ORDER By clause is familiar to other SQL dialects.

 It performs total ordering of the query result set

time to execute larger data sets.

 Hive adds an alternate, SORT By that orders the data only within each reducer, there by performing a local ordering.

Example for ORDER By:-

hive> SELECT s.ymd, s.symbol, s.price-close FROM stocks s
ORDER B s.ymd Asc, s. symbol DESC;

Example for SORT By:-

hive>SELECT s.ymd, s. sym bol, s.price-close From stock s Sort by s.ymd Asc, s. symbol DESC;

 The two queries look almost identical, but if more than one reduce is invoked, the output will be sorted differently.

While each reducer output files are sorted, the data will probably overlap with the output of other reducers.

 Hive will require a LIMIT clause with ORDER By, if the property Hive is mapped. Mode B is set to strict.

DISTRIBUTE BY with SORT BY:-

 Distribute by controls how a map output must be divided among reduces.

 All data that flows through a map Reduce job is organized into key- value pairs.

 Hive must use sub feature internally.

 We can use DISTRIBUTE BY to ensure that the records for each stock symbol go to the save reducer, then use SORT By to order the data the way we want.

 The following query demonstrates this technique:

hive>SELECT s.ymd, s.sym bol s. price –close FROM stocks
DISTRIBUTE By s. symbol sort by s. symbol Asc, s.ymd asc;

CLUSTER BY:

 CLUSTER BY is the short way of expressing the Distribute by with sort by

In the previous example., s. symbol col was used in the DISTRIBUTE By clause and s. symbol and s. ymd columns in the SORT By clause.

Ex:-hive>SELECT s.ymd, s. symbol, s. price-close FROM stock s cluster BY s.symbol;

UNION ALL:

 UNION ALL combines two or more tables.

 Each sub query of the union query must produce the same number of columns and for each column, its type must match all the column types in the same position.

 For example, if the second column is a false, then the second column of all the other query results must be a FLOAT.

 Below Example shows the merging of log data

SELECT log.ymd, log. Level.log. message
FROM(SELECT L1.ymd,L1. Level;L1. Message,’log1’ As source FROM log1 L1UNION ALL SELECT L2.ymd, 
L2.leve, L2.message,’log2’As source FROM log1 l2) log SORT By log.ynd ASC;

 UNION may be used when a clause selects from the same source table.

Exporting Data:-

 If the data files are already formatted the way you want, then it is simple enough to copy the directors or files.

Cmd# hadoop fs-cp source- path target-path

 other wise, you can use INSERT—Directory as in this Example:

hive>INSERT OVER WRITE LOCAL DIRECTORY ‘/tmp/ca-employees’
SELECT Name, salary, address FROM employees WHERE
Se.state=’CA’;

 One or more files will be written to /fmp/ca-employees; depending on the number of reducers invoked.

List of Big Data Courses:

 Hadoop Adminstartion MapReduce
 Big Data On AWS Informatica Big Data Integration
 Bigdata Greenplum DBA Informatica Big Data Edition
 Hadoop Hive Impala
 Hadoop Testing Apache Mahout

Job Support Program

Online Work Support for your on-job roles.

jobservice

Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:

  • Pay Per Hour
  • Pay Per Week
  • Monthly
Learn MoreGet Job Support
Course Schedule
NameDates
Hadoop TrainingJan 25 to Feb 09View Details
Hadoop TrainingJan 28 to Feb 12View Details
Hadoop TrainingFeb 01 to Feb 16View Details
Hadoop TrainingFeb 04 to Feb 19View Details
Last updated: 04 Apr 2023
About Author

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less