HADOOP ONLINE TRAINING CAL FOR 09640156134: June 2013

Hive Class 2

Hive Data Organized as

ØDatabases –Namespace that separates tables

ØTables -Homogeneous units of data with same schema

ØPartitions -A way of dividing a table into coarse-grained Parts based on the value of a partition column, such as date.

ØBuckets -Tables or partitions may further be subdivided into buckets, to give extra structure to the data that may be used for more efficient queries.

Difference B/W HIVE and SQL

HIVE class1

What is Hive :

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides

ØTools to enable easy data extract/transform/load (ETL)

ØA mechanism to impose structure on a variety of data formats

ØAccess to files stored either directly in Apache HDFS or in other data storage systems such as Apache Hbase

ØQuery execution via Map Reduce

ØProvides SQL like query language –HIVE QL. Allows to plug in custom mappers/reducers in queries and also allows UDFs

ØHive is designed to enable easy data summarization, ad-hoc querying

and analysis of large volumes of data

What Hive is Not ?

ØLatency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as RDBMS.

ØHive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). Table update is achieved by transforming data into new table.

ØHive schema is not a ‘schema on write’. Does not verify the data when it is loaded. It is a ‘schema on read’ which verifies data on SQL query.

ØUnlike RDBMS table, More than one schema can be applied to same data

ØHive doesn’t define clear semantics for concurrent access to tables, which means applications need to build their own application-level concurrency or locking mechanism. The Hive team is actively working on improvements in all these areas.

Hive Used For

ØLog Processing

ØText Mining

ØDocument Indexing

ØCustomer-facing business intelligence (Google analytics)

ØPredictive Modeling

ØHypothesis Testing

Hive Architecture

ØMetastore : Stores system catalog

ØDriver: manages life cycle of HiveQLquery as it moves thru’ HIVE; also manages session handle and session statistics

ØQuery compiler: Compiles HiveQLinto a directed acyclic graph of map/reduce tasks

ØExecution engines: The component executes the tasks in proper dependency order; interacts with Hadoop

ØHive Server : provides Thrift interface and JDBC/ODBC for integrating other applications.

ØClient components: CLI, web interface, jdbc/odbcinteface

ØExtensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function.

Echo Systems Definitions

SQOOP: Used to import Data from RDBMS and also Export data into RDBMS.

Flume: To import Streaming Data

OOZIE: To Schedule the Jobs(Hadoop) and Define Workflows

ZOOKEEPER: State Maintenance (Controlling Data Locks)

To store the records, failure of records nd Controlling the data locks.

Map Reduce: Map Reduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.

The framework is divided into two parts:

Map Process:
In this process input is taken by the master node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.

Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and reduce are performed in distributed mode. Since each operation is independent, so each map can be

performed in parallel and hence reducing the net computing time.

Friday, June 7, 2013

Hive Class 2

HIVE class1

Echo Systems Definitions