Saturday, February 23, 2013

Hive+Xml Processing

Hive+Xml Processing

First to understand XPATH() : by using this to parse XML data into String array.

Example: small xml data

<rec><name>Radha</name><age>23</age><sex>female</sex></rec>

NOTE: xml data converted into hive table in two steps process

1. Convert the xml data into array format

2. Array data can be converted into hive table format.

Process :

Step1: create the hive table

Ex: Hive>create table hivexml(str string);

Step2: load the xmldata into hive table

EX: Hive>load data local inpath ‘xmlfile’ into table hivexml;

Ø By this step load all local xml data into your hive table astise, so we can convert that data into STRING ARRAY format by using XPATH(), And then we can convert the array data into normal hive table data,

Step3: convert the xml data into array format

EX: Hive>select xpath(str,’rec/*/text()’) from xmlhive;

Ø OutPut: ["Babu”,”25”,”male”]

[“Radha”,”23”,”female”]

Explanation of ‘rec/*/text()’

rec: its define Node of xml same as XML DATA (Check the xml data)

*: its define all the fields of xml data.

If you want specific fields simply mansion it like below

Ex: Hive>select xpath(str,’rec/name/text()’)from xmlhive;

Ø OutPut: [“Babu”]

[“Radha”]

Step4: crate the HIVE table required columns

EX: Hive> create table newhivexml(name string,age int,sex string);

Ø After creating the table to load the xml array format data into newhivexml table like below

Step5:

Hive> insert overwrite table newhivexml select xpath_string(str,'rec/name'),xpath_string(str,'rec/age'),xpath_string(str,'rec/sex')from hivexml;

Hive>select * from newhivwxml ;

To get the data in table format like below.

name age sex

Babu 25 male

Radha 23 female

Thank you.

* This note only for to get some basic idea purpose give me your feedback

Friday, February 8, 2013

Hadoop Ecosystem

The above diagram Clearly explain the Hadoop Echosystem, These are combination of different Techknowledges all are doing different types of works shown the above program you can understand clearly .
The Echosystem of Hadoop is

Name Purpose

1. Hive ( Data WareHouse)
2. Pig (Text Mining)
3. Hbase (Random Operations)
4. Sqoop (Export and Import)
5. Flume (Streaming Data)
6. Ooziee (Scheduler nd Workflow Design)
7.Zookeeper (State Maintenance)

1. Hive :

Hive is a Data Warehouse in Hadoop Environment.
To process Structured and Semi-structured and Un-Structured data.
Un-Structured data can be processed by converting into Structured data.

2. Pig :

Pig is used for Text analytic (mining)
Pig is Processed for Xml and Json data.
Even though data is Structured and impossible of Hive (hql) can be processed by Pig.
The additional Functionalists (not possible of Pig) can be done by Using UDF (User Define Functions)
The Pig UDF's can be done in following languages

i.e : Java,Ruby,Python,Java script, C++, etc........

Hive also supports UDF's hive udf's can be done in

i.e: Java,Ruby,Python,C++,R Program etc........

When you run high query's in pig automatically java Map Reduce code will be build by the frame it will be submitted by JVM.

Tuesday, February 5, 2013

Hadoop Training

Hi Welcome to Programming Hadoop

HADOOP -the full story

Hi,

Here I want to explain brief about "Hadoop".

Hadoop is a Opensource product developed by Apache. to handle BigData.

Generelly, DataWareHouses and BI(Business Intelligence) systems, are able to work

on structured data.... but not on unstructured data.

example for unstructured data is large text files, documents, log files(database log or weblog),
web crawling data... for these things definetely the storage should be in file systems.

That too... The BI tools and DW tools can not process huge volumes of data...if data is petabytes,
zettabytes or yottabytes....Where hadoop is able to process both structured and unstructured format
of data in big sizes.

to store this, powerful filesystem is provided by Hadoop called HDFS.

To process this data in faster manner, Hadoop is providing some echo systems as follows.

MapReduce,
Pig,
Hive,
Hbase,
flume,
Sqoop etc.

and other nosql databases are also used such as

MongoDb,
DynamoDb,
CouchDb etc.

for more details please send me reply...

M.Kaali Babu
hadooptoall@gmail.com