Monday, September 9, 2013

HADOOP IMPORTANT LINKS

http://www.techspritz.com/hadoop-single-node-cluster-setup/

http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-academic-papers-4th-update-may-2011/

http://hortonworks.com/blog/hadoop-hadoop-hurrah-hdp-for-windows-is-now-ga/

http://bigdatastudio.com/2013/05/19/big-data-jobs/

http://hadoopblog.blogspot.in/2010/05/facebook-has-worlds-largest-hadoop.html?goback=.gde_4244719_member_243018706

http://www.youtube.com/watch?v=A02SRdyoshM

http://jugnu-life.blogspot.com/2012/03/installing-pig-apache-hadoop-pig.html

http://www.aptibook.com/Technical/Hadoop-interview-questions-and-answers?id=2

http://www.pappupass.com/class/index.php/hadoop/hadoop-interview-questions

http://www.rohitmenon.com/index.php/cloudera-certified-hadoop- developer-ccd-410/

http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map- reduce-example.html

Partitions

Partition : means to categorize the data in a table.
Ø Whenever we request a piece of data we use Partitions  by default  it is a Non-Partitioned Table.
Ø
Types:  1. Partitioned
              2. Non – Partitioned  (by Default)
EX: Non-Partitioned:
 Syntax: create table <table name>(col1 data type,col2 data type, …………) row format  delimited
                                  fields  terminated  by  ‘,’
Loading:  load data local inpath ‘<local file name>’ into table  <table name>;
EX: Partitioned:


Syntax  EX:   hive> create table sales_day(prid int,prname string,quantity int,price double,branch string) partitioned by (day int,month int,year int) row format delimited fields terminated by ',';                          


hive> load data local inpath 'sales' into table sales_day partition(day=12,month=2,year=2013);

Hive> load data local inpath ‘sales2’ into table sales_day partition(day=13,month=2,year=2013);

Hive> select * from sales_day;

Note : 
ØIn hive Partitioned  are logical  in RDBMS the partitions are Physical;

ØWe use the technique of partitions  to  manage  incremental   loads;

Managed Tables and External Tables

When you create a table in Hive, by default Hive will manage the data, which means that Hive moves the data into its warehouse directory.
Alternatively, you may create an external table, which tells Hive to refer to the data that is at an existing location outside the warehouse directory.

The difference between the two types of table is seen in the LOAD and DROP  Semantics.

CREATE TABLE managed_table(dummy STRING);
LOAD DATA INPATH   '/user/tom/data.txt' INTO table managed_table;

CREATE EXTERNAL TABLE external_table(dummy STRING)
            LOCATION   '/user/tom/external_table';

LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;


Which one to use?
As a rule of thumb, if you are doing all your processing with Hive, then use managed tables, but if you wish to use Hive and other tools on the same dataset, then use external tables. A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table. This works the other way around, too—an external table (not necessarily on HDFS) can be used to export data from Hive for other applications to use.
Another reason for using external tables is when you wish to associate multiple schemas with the same dataset.