Wednesday, November 5, 2014

Performance tuning of hive queries

Performance tuning of hive queries


Hive performance optimization is a larger topic on its own and is very specific to the queries you are using. Infact each query in a query file needs separate performance tuning to get the most robust results.

I'll try to list a few approaches in general used for performance optimization
Limit the data flow down the queries
When you are on a hive query the volume of data that flows each level down is the factor that decides performance. So if you are executing a script that contains a sequence of hive QL, make sure that the data filtration happens on the first few stages rather than bringing unwanted data to bottom. This will give you significant performance numbers as the queries down the lane will have very less data to crunch on.

This is a common bottle neck when some existing SQL jobs are ported to hive, we just try to execute the same sequence of SQL steps in hive as well which becomes a bottle neck on the performance. Understand the requirement or the existing SQL script and design your hive job considering data flow 

Use hive merge files
Hive queries are parsed into map only and map reduce job. In a hive script there will lots of hive queries. Assume one of your queries is parsed to a mapreduce job and the output files from the job are very small, say 10 mb. In such a case the subsequent query that consumes this data may generate more number of map tasks and would be inefficient. If you have more jobs on the same data set then all the jobs will get inefficient. In such scenarios if you enable merge files in hive, the first query would run a merge job at the end there by merging small files into  larger ones. This is controlled
using the following parameters

hive.merge.mapredfiles=true
hive.merge.mapfiles=true (true by default in hive)

For more control over merge files you can tweak these properties as well
hive.merge.size.per.task (the max final size of a file after the merge task)
hive.merge.smallfiles.avgsize (the merge job is triggered only if the average output filesizes is less than the specified value)

The default values for the above properties are
hive.merge.size.per.task=256000000
hive.merge.smallfiles.avgsize=16000000

When you enable merge an extra map only job is triggered, whether this job gets you anoptimization or an over head is totally dependent on your use case or the queries.

Join Optimizations
Joins are very expensive.Avoid it if possible. If it is required try to use join optimizations as map joins, bucketed map joins etc


There is still more left on hive query performance optimization, take this post as the baby step. More tobe added on to this post and will be addded soon . :)

Monday, September 9, 2013

HADOOP IMPORTANT LINKS

http://www.techspritz.com/hadoop-single-node-cluster-setup/

http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-academic-papers-4th-update-may-2011/

http://hortonworks.com/blog/hadoop-hadoop-hurrah-hdp-for-windows-is-now-ga/

http://bigdatastudio.com/2013/05/19/big-data-jobs/

http://hadoopblog.blogspot.in/2010/05/facebook-has-worlds-largest-hadoop.html?goback=.gde_4244719_member_243018706

http://www.youtube.com/watch?v=A02SRdyoshM

http://jugnu-life.blogspot.com/2012/03/installing-pig-apache-hadoop-pig.html

http://www.aptibook.com/Technical/Hadoop-interview-questions-and-answers?id=2

http://www.pappupass.com/class/index.php/hadoop/hadoop-interview-questions

http://www.rohitmenon.com/index.php/cloudera-certified-hadoop- developer-ccd-410/

http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map- reduce-example.html

Partitions

Partition : means to categorize the data in a table.
Ø Whenever we request a piece of data we use Partitions  by default  it is a Non-Partitioned Table.
Ø
Types:  1. Partitioned
              2. Non – Partitioned  (by Default)
EX: Non-Partitioned:
 Syntax: create table <table name>(col1 data type,col2 data type, …………) row format  delimited
                                  fields  terminated  by  ‘,’
Loading:  load data local inpath ‘<local file name>’ into table  <table name>;
EX: Partitioned:


Syntax  EX:   hive> create table sales_day(prid int,prname string,quantity int,price double,branch string) partitioned by (day int,month int,year int) row format delimited fields terminated by ',';                          


hive> load data local inpath 'sales' into table sales_day partition(day=12,month=2,year=2013);

Hive> load data local inpath ‘sales2’ into table sales_day partition(day=13,month=2,year=2013);

Hive> select * from sales_day;

Note : 
ØIn hive Partitioned  are logical  in RDBMS the partitions are Physical;

ØWe use the technique of partitions  to  manage  incremental   loads;

Managed Tables and External Tables

When you create a table in Hive, by default Hive will manage the data, which means that Hive moves the data into its warehouse directory.
Alternatively, you may create an external table, which tells Hive to refer to the data that is at an existing location outside the warehouse directory.

The difference between the two types of table is seen in the LOAD and DROP  Semantics.

CREATE TABLE managed_table(dummy STRING);
LOAD DATA INPATH   '/user/tom/data.txt' INTO table managed_table;

CREATE EXTERNAL TABLE external_table(dummy STRING)
            LOCATION   '/user/tom/external_table';

LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;


Which one to use?
As a rule of thumb, if you are doing all your processing with Hive, then use managed tables, but if you wish to use Hive and other tools on the same dataset, then use external tables. A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table. This works the other way around, too—an external table (not necessarily on HDFS) can be used to export data from Hive for other applications to use.
Another reason for using external tables is when you wish to associate multiple schemas with the same dataset.


Friday, June 7, 2013

Hive Class 2

Hive Data Organized as

ØDatabases –Namespace that separates tables
ØTables -Homogeneous units of data with same schema
ØPartitions -A way of dividing a table into coarse-grained  Parts based on the value of a partition column, such as date.
ØBuckets -Tables or partitions may further be subdivided into buckets, to give extra structure to the data that may be used for more efficient queries.

Difference B/W  HIVE and SQL


HIVE class1

                                          What is Hive :
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides
ØTools to enable easy data extract/transform/load (ETL)
ØA mechanism to impose structure on a variety of data formats
ØAccess to files stored either directly in Apache HDFS or in other data storage systems such as Apache Hbase
ØQuery execution via Map Reduce
ØProvides SQL like query language –HIVE QL. Allows to plug in custom mappers/reducers in queries and also allows UDFs
ØHive is designed to enable easy data summarization, ad-hoc querying


         and analysis of large volumes of data


                              What Hive is Not ?
ØLatency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as RDBMS.
ØHive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). Table update is achieved by transforming data into new table.
ØHive schema is not a ‘schema on write’. Does not verify the data when it is loaded. It is a ‘schema on read’ which verifies data on SQL query.
ØUnlike RDBMS table, More than one schema can be applied to same data
ØHive doesn’t define clear semantics for concurrent access to tables, which means applications need to build their own application-level concurrency or locking mechanism. The Hive team is actively working on improvements in all these areas.

                                       Hive Used For
ØLog Processing
ØText Mining
ØDocument Indexing
ØCustomer-facing business intelligence (Google analytics)
ØPredictive Modeling
ØHypothesis Testing


Hive Architecture
ØMetastore : Stores system catalog
ØDriver: manages life cycle of HiveQLquery as it moves thru’ HIVE; also manages session handle and session statistics
ØQuery compiler: Compiles HiveQLinto a directed acyclic graph of map/reduce tasks
ØExecution engines: The component executes the tasks in proper dependency order; interacts with Hadoop
ØHive Server : provides Thrift interface and JDBC/ODBC for integrating other applications.
ØClient components: CLI, web interface, jdbc/odbcinteface
ØExtensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function.

Echo Systems Definitions

SQOOP: Used to import Data from RDBMS and also Export data into RDBMS.     
Flume:  To import Streaming Data
OOZIE:  To Schedule the Jobs(Hadoop) and Define Workflows


ZOOKEEPER:  State Maintenance (Controlling Data Locks)
 To store the records, failure of records nd Controlling the data locks.

Map Reduce: Map Reduce is a software framework that  allows developers to write programs that process  massive amounts of unstructured data in parallel across a distributed  cluster of processors or stand-alone computers.
The framework is divided into two parts:
Map Process:
In this process input is taken by the master
node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.
Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and
reduce are performed in distributed mode. Since each operation is independent, so each map can be
performed in parallel and hence reducing the net computing time.