What
is Hive :
The Apache Hive data warehouse software
facilitates querying and managing large datasets residing in distributed
storage. Built on top of Apache Hadoop, it provides
ØTools
to
enable easy data extract/transform/load (ETL)
ØA
mechanism
to impose structure on a variety of data formats
ØAccess
to
files stored either directly in Apache HDFS or in other data storage systems
such as Apache Hbase
ØQuery
execution
via Map
Reduce
ØProvides
SQL
like query language –HIVE QL. Allows to plug in custom mappers/reducers in
queries and also allows UDFs
ØHive
is
designed to enable easy data summarization, ad-hoc querying
and analysis
of large volumes of data
What
Hive is Not ?
ØLatency
for Hive queries is generally very high (minutes) even when data sets involved
are very small (say a few hundred megabytes). As a result it cannot be compared
with systems such as RDBMS.
ØHive
is
not designed for online transaction processing and does not offer real-time
queries and row level updates. It is best used for batch jobs over large sets
of immutable data (like web logs). Table update is achieved by transforming
data into new table.
ØHive
schema
is not a ‘schema on write’. Does not verify the data when it is loaded. It is a
‘schema on read’ which verifies data on SQL query.
ØUnlike
RDBMS
table, More than one schema can be applied to same data
ØHive
doesn’t
define clear semantics for concurrent access to tables, which means
applications need to build their own application-level concurrency or locking
mechanism. The Hive team is actively working on improvements in all these
areas.
Hive
Used For
ØLog
Processing
ØText
Mining
ØDocument
Indexing
ØCustomer-facing
business
intelligence (Google analytics)
ØPredictive
Modeling
ØHypothesis
Testing
Hive
Architecture
ØMetastore
: Stores
system catalog
ØDriver:
manages life cycle of HiveQLquery as it moves thru’ HIVE; also manages session
handle and session statistics
ØQuery
compiler:
Compiles HiveQLinto a directed acyclic graph of map/reduce tasks
ØExecution
engines:
The component executes the tasks in proper dependency order; interacts with
Hadoop
ØHive
Server : provides
Thrift interface and JDBC/ODBC for integrating other applications.
ØClient
components:
CLI, web interface, jdbc/odbcinteface
ØExtensibility
interface
include SerDe, User Defined Functions and User Defined Aggregate Function.