Monday, September 9, 2013

HADOOP IMPORTANT LINKS

http://www.techspritz.com/hadoop-single-node-cluster-setup/

http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-academic-papers-4th-update-may-2011/

http://hortonworks.com/blog/hadoop-hadoop-hurrah-hdp-for-windows-is-now-ga/

http://bigdatastudio.com/2013/05/19/big-data-jobs/

http://hadoopblog.blogspot.in/2010/05/facebook-has-worlds-largest-hadoop.html?goback=.gde_4244719_member_243018706

http://www.youtube.com/watch?v=A02SRdyoshM

http://jugnu-life.blogspot.com/2012/03/installing-pig-apache-hadoop-pig.html

http://www.aptibook.com/Technical/Hadoop-interview-questions-and-answers?id=2

http://www.pappupass.com/class/index.php/hadoop/hadoop-interview-questions

http://www.rohitmenon.com/index.php/cloudera-certified-hadoop- developer-ccd-410/

http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map- reduce-example.html

Partitions

Partition : means to categorize the data in a table.

Ø Whenever we request a piece of data we use Partitions by default it is a Non-Partitioned Table.

Types: 1. Partitioned

2. Non – Partitioned (by Default)

EX: Non-Partitioned:

Syntax: create table <table name>(col1 data type,col2 data type, …………) row format delimited

fields terminated by ‘,’

Loading: load data local inpath ‘<local file name>’ into table <table name>;

EX: Partitioned:

Syntax EX: hive> create table sales_day(prid int,prname string,quantity int,price double,branch string) partitioned by (day int,month int,year int) row format delimited fields terminated by ',';

hive> load data local inpath 'sales' into table sales_day partition(day=12,month=2,year=2013);

Hive> load data local inpath ‘sales2’ into table sales_day partition(day=13,month=2,year=2013);

Hive> select * from sales_day;

Note :

ØIn hive Partitioned are logical in RDBMS the partitions are Physical;

ØWe use the technique of partitions to manage incremental loads;

Managed Tables and External Tables

When you create a table in Hive, by default Hive will manage the data, which means that Hive moves the data into its warehouse directory.

Alternatively, you may create an external table, which tells Hive to refer to the data that is at an existing location outside the warehouse directory.

The difference between the two types of table is seen in the LOAD and DROP Semantics.

CREATE TABLE managed_table(dummy STRING);

LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

CREATE EXTERNAL TABLE external_table(dummy STRING)

LOCATION '/user/tom/external_table';

LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;

Which one to use?

As a rule of thumb, if you are doing all your processing with Hive, then use managed tables, but if you wish to use Hive and other tools on the same dataset, then use external tables. A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table. This works the other way around, too—an external table (not necessarily on HDFS) can be used to export data from Hive for other applications to use.

Another reason for using external tables is when you wish to associate multiple schemas with the same dataset.

Friday, June 7, 2013

Hive Class 2

Hive Data Organized as

ØDatabases –Namespace that separates tables

ØTables -Homogeneous units of data with same schema

ØPartitions -A way of dividing a table into coarse-grained Parts based on the value of a partition column, such as date.

ØBuckets -Tables or partitions may further be subdivided into buckets, to give extra structure to the data that may be used for more efficient queries.

Difference B/W HIVE and SQL

HIVE class1

What is Hive :

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides

ØTools to enable easy data extract/transform/load (ETL)

ØA mechanism to impose structure on a variety of data formats

ØAccess to files stored either directly in Apache HDFS or in other data storage systems such as Apache Hbase

ØQuery execution via Map Reduce

ØProvides SQL like query language –HIVE QL. Allows to plug in custom mappers/reducers in queries and also allows UDFs

ØHive is designed to enable easy data summarization, ad-hoc querying

and analysis of large volumes of data

What Hive is Not ?

ØLatency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as RDBMS.

ØHive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). Table update is achieved by transforming data into new table.

ØHive schema is not a ‘schema on write’. Does not verify the data when it is loaded. It is a ‘schema on read’ which verifies data on SQL query.

ØUnlike RDBMS table, More than one schema can be applied to same data

ØHive doesn’t define clear semantics for concurrent access to tables, which means applications need to build their own application-level concurrency or locking mechanism. The Hive team is actively working on improvements in all these areas.

Hive Used For

ØLog Processing

ØText Mining

ØDocument Indexing

ØCustomer-facing business intelligence (Google analytics)

ØPredictive Modeling

ØHypothesis Testing

Hive Architecture

ØMetastore : Stores system catalog

ØDriver: manages life cycle of HiveQLquery as it moves thru’ HIVE; also manages session handle and session statistics

ØQuery compiler: Compiles HiveQLinto a directed acyclic graph of map/reduce tasks

ØExecution engines: The component executes the tasks in proper dependency order; interacts with Hadoop

ØHive Server : provides Thrift interface and JDBC/ODBC for integrating other applications.

ØClient components: CLI, web interface, jdbc/odbcinteface

ØExtensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function.

Echo Systems Definitions

SQOOP: Used to import Data from RDBMS and also Export data into RDBMS.

Flume: To import Streaming Data

OOZIE: To Schedule the Jobs(Hadoop) and Define Workflows

ZOOKEEPER: State Maintenance (Controlling Data Locks)

To store the records, failure of records nd Controlling the data locks.

Map Reduce: Map Reduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.

The framework is divided into two parts:

Map Process:
In this process input is taken by the master node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.

Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and reduce are performed in distributed mode. Since each operation is independent, so each map can be

performed in parallel and hence reducing the net computing time.

Saturday, February 23, 2013

Hive+Xml Processing

Hive+Xml Processing

First to understand XPATH() : by using this to parse XML data into String array.

Example: small xml data

<rec><name>Radha</name><age>23</age><sex>female</sex></rec>

NOTE: xml data converted into hive table in two steps process

1. Convert the xml data into array format

2. Array data can be converted into hive table format.

Process :

Step1: create the hive table

Ex: Hive>create table hivexml(str string);

Step2: load the xmldata into hive table

EX: Hive>load data local inpath ‘xmlfile’ into table hivexml;

Ø By this step load all local xml data into your hive table astise, so we can convert that data into STRING ARRAY format by using XPATH(), And then we can convert the array data into normal hive table data,

Step3: convert the xml data into array format

EX: Hive>select xpath(str,’rec/*/text()’) from xmlhive;

Ø OutPut: ["Babu”,”25”,”male”]

[“Radha”,”23”,”female”]

Explanation of ‘rec/*/text()’

rec: its define Node of xml same as XML DATA (Check the xml data)

*: its define all the fields of xml data.

If you want specific fields simply mansion it like below

Ex: Hive>select xpath(str,’rec/name/text()’)from xmlhive;

Ø OutPut: [“Babu”]

[“Radha”]

Step4: crate the HIVE table required columns

EX: Hive> create table newhivexml(name string,age int,sex string);

Ø After creating the table to load the xml array format data into newhivexml table like below

Step5:

Hive> insert overwrite table newhivexml select xpath_string(str,'rec/name'),xpath_string(str,'rec/age'),xpath_string(str,'rec/sex')from hivexml;

Hive>select * from newhivwxml ;

To get the data in table format like below.

name age sex

Babu 25 male

Radha 23 female

Thank you.

* This note only for to get some basic idea purpose give me your feedback

Friday, February 8, 2013

Hadoop Ecosystem

The above diagram Clearly explain the Hadoop Echosystem, These are combination of different Techknowledges all are doing different types of works shown the above program you can understand clearly .
The Echosystem of Hadoop is

Name Purpose

1. Hive ( Data WareHouse)
2. Pig (Text Mining)
3. Hbase (Random Operations)
4. Sqoop (Export and Import)
5. Flume (Streaming Data)
6. Ooziee (Scheduler nd Workflow Design)
7.Zookeeper (State Maintenance)

1. Hive :

Hive is a Data Warehouse in Hadoop Environment.
To process Structured and Semi-structured and Un-Structured data.
Un-Structured data can be processed by converting into Structured data.

2. Pig :

Pig is used for Text analytic (mining)
Pig is Processed for Xml and Json data.
Even though data is Structured and impossible of Hive (hql) can be processed by Pig.
The additional Functionalists (not possible of Pig) can be done by Using UDF (User Define Functions)
The Pig UDF's can be done in following languages

i.e : Java,Ruby,Python,Java script, C++, etc........

Hive also supports UDF's hive udf's can be done in

i.e: Java,Ruby,Python,C++,R Program etc........

When you run high query's in pig automatically java Map Reduce code will be build by the frame it will be submitted by JVM.

Tuesday, February 5, 2013

Hadoop Training

Hi Welcome to Programming Hadoop

HADOOP -the full story

Hi,

Here I want to explain brief about "Hadoop".

Hadoop is a Opensource product developed by Apache. to handle BigData.

Generelly, DataWareHouses and BI(Business Intelligence) systems, are able to work

on structured data.... but not on unstructured data.

example for unstructured data is large text files, documents, log files(database log or weblog),
web crawling data... for these things definetely the storage should be in file systems.

That too... The BI tools and DW tools can not process huge volumes of data...if data is petabytes,
zettabytes or yottabytes....Where hadoop is able to process both structured and unstructured format
of data in big sizes.

to store this, powerful filesystem is provided by Hadoop called HDFS.

To process this data in faster manner, Hadoop is providing some echo systems as follows.

MapReduce,
Pig,
Hive,
Hbase,
flume,
Sqoop etc.

and other nosql databases are also used such as

MongoDb,
DynamoDb,
CouchDb etc.

for more details please send me reply...

M.Kaali Babu
hadooptoall@gmail.com