HADOOP ONLINE TRAINING CAL FOR 09640156134: HIVE class1

What is Hive :

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides

ØTools to enable easy data extract/transform/load (ETL)

ØA mechanism to impose structure on a variety of data formats

ØAccess to files stored either directly in Apache HDFS or in other data storage systems such as Apache Hbase

ØQuery execution via Map Reduce

ØProvides SQL like query language –HIVE QL. Allows to plug in custom mappers/reducers in queries and also allows UDFs

ØHive is designed to enable easy data summarization, ad-hoc querying

and analysis of large volumes of data

What Hive is Not ?

ØLatency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as RDBMS.

ØHive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). Table update is achieved by transforming data into new table.

ØHive schema is not a ‘schema on write’. Does not verify the data when it is loaded. It is a ‘schema on read’ which verifies data on SQL query.

ØUnlike RDBMS table, More than one schema can be applied to same data

ØHive doesn’t define clear semantics for concurrent access to tables, which means applications need to build their own application-level concurrency or locking mechanism. The Hive team is actively working on improvements in all these areas.

Hive Used For

ØLog Processing

ØText Mining

ØDocument Indexing

ØCustomer-facing business intelligence (Google analytics)

ØPredictive Modeling

ØHypothesis Testing

Hive Architecture

ØMetastore : Stores system catalog

ØDriver: manages life cycle of HiveQLquery as it moves thru’ HIVE; also manages session handle and session statistics

ØQuery compiler: Compiles HiveQLinto a directed acyclic graph of map/reduce tasks

ØExecution engines: The component executes the tasks in proper dependency order; interacts with Hadoop

ØHive Server : provides Thrift interface and JDBC/ODBC for integrating other applications.

ØClient components: CLI, web interface, jdbc/odbcinteface

ØExtensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function.

HADOOP ONLINE TRAINING CAL FOR 09640156134

Friday, June 7, 2013

HIVE class1

1 comment: