Friday, June 7, 2013

HIVE class1

                                          What is Hive :
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides
ØTools to enable easy data extract/transform/load (ETL)
ØA mechanism to impose structure on a variety of data formats
ØAccess to files stored either directly in Apache HDFS or in other data storage systems such as Apache Hbase
ØQuery execution via Map Reduce
ØProvides SQL like query language –HIVE QL. Allows to plug in custom mappers/reducers in queries and also allows UDFs
ØHive is designed to enable easy data summarization, ad-hoc querying


         and analysis of large volumes of data


                              What Hive is Not ?
ØLatency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as RDBMS.
ØHive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). Table update is achieved by transforming data into new table.
ØHive schema is not a ‘schema on write’. Does not verify the data when it is loaded. It is a ‘schema on read’ which verifies data on SQL query.
ØUnlike RDBMS table, More than one schema can be applied to same data
ØHive doesn’t define clear semantics for concurrent access to tables, which means applications need to build their own application-level concurrency or locking mechanism. The Hive team is actively working on improvements in all these areas.

                                       Hive Used For
ØLog Processing
ØText Mining
ØDocument Indexing
ØCustomer-facing business intelligence (Google analytics)
ØPredictive Modeling
ØHypothesis Testing


Hive Architecture
ØMetastore : Stores system catalog
ØDriver: manages life cycle of HiveQLquery as it moves thru’ HIVE; also manages session handle and session statistics
ØQuery compiler: Compiles HiveQLinto a directed acyclic graph of map/reduce tasks
ØExecution engines: The component executes the tasks in proper dependency order; interacts with Hadoop
ØHive Server : provides Thrift interface and JDBC/ODBC for integrating other applications.
ØClient components: CLI, web interface, jdbc/odbcinteface
ØExtensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function.

1 comment:

  1. Hi
    Really very informative.Thanks for sharing.Really Hadoop is ruling the world.Recently I bought the hadoop videos from http://www.hadooponlinetutor.com.The videos are superb.

    ReplyDelete