Friday, June 7, 2013

Hive Class 2

Hive Data Organized as

ØDatabases –Namespace that separates tables
ØTables -Homogeneous units of data with same schema
ØPartitions -A way of dividing a table into coarse-grained  Parts based on the value of a partition column, such as date.
ØBuckets -Tables or partitions may further be subdivided into buckets, to give extra structure to the data that may be used for more efficient queries.

Difference B/W  HIVE and SQL


HIVE class1

                                          What is Hive :
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides
ØTools to enable easy data extract/transform/load (ETL)
ØA mechanism to impose structure on a variety of data formats
ØAccess to files stored either directly in Apache HDFS or in other data storage systems such as Apache Hbase
ØQuery execution via Map Reduce
ØProvides SQL like query language –HIVE QL. Allows to plug in custom mappers/reducers in queries and also allows UDFs
ØHive is designed to enable easy data summarization, ad-hoc querying


         and analysis of large volumes of data


                              What Hive is Not ?
ØLatency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as RDBMS.
ØHive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). Table update is achieved by transforming data into new table.
ØHive schema is not a ‘schema on write’. Does not verify the data when it is loaded. It is a ‘schema on read’ which verifies data on SQL query.
ØUnlike RDBMS table, More than one schema can be applied to same data
ØHive doesn’t define clear semantics for concurrent access to tables, which means applications need to build their own application-level concurrency or locking mechanism. The Hive team is actively working on improvements in all these areas.

                                       Hive Used For
ØLog Processing
ØText Mining
ØDocument Indexing
ØCustomer-facing business intelligence (Google analytics)
ØPredictive Modeling
ØHypothesis Testing


Hive Architecture
ØMetastore : Stores system catalog
ØDriver: manages life cycle of HiveQLquery as it moves thru’ HIVE; also manages session handle and session statistics
ØQuery compiler: Compiles HiveQLinto a directed acyclic graph of map/reduce tasks
ØExecution engines: The component executes the tasks in proper dependency order; interacts with Hadoop
ØHive Server : provides Thrift interface and JDBC/ODBC for integrating other applications.
ØClient components: CLI, web interface, jdbc/odbcinteface
ØExtensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function.

Echo Systems Definitions

SQOOP: Used to import Data from RDBMS and also Export data into RDBMS.     
Flume:  To import Streaming Data
OOZIE:  To Schedule the Jobs(Hadoop) and Define Workflows


ZOOKEEPER:  State Maintenance (Controlling Data Locks)
 To store the records, failure of records nd Controlling the data locks.

Map Reduce: Map Reduce is a software framework that  allows developers to write programs that process  massive amounts of unstructured data in parallel across a distributed  cluster of processors or stand-alone computers.
The framework is divided into two parts:
Map Process:
In this process input is taken by the master
node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.
Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and
reduce are performed in distributed mode. Since each operation is independent, so each map can be
performed in parallel and hence reducing the net computing time.