Setting MultiNode Hadoop Cluster at Home

  1. Get VirtualBox software to create multiple virtual machines for experimentation

  2. Download CentOS installer DVD image

  3. Install CentOS as new RedHat 64 bit virtual machine in VirtualBox

After the installation, do ‘yum update’ to synchronize the OS with the latest release

  1. Installing Guest additions for mounting parent system folders or so

first install following packages in guest operating system

yum perl install dkms binutils gcc make patch libgomp glibc-headers glibc-devel kernel-headers kernel-devel

Then mount the guest additions from the Virtualbox menu (This will not create mount point though)

Create mount point for the guest additions, like,

mkdir /media/cdrom; mount /dev/sr0 /media/cdrom

Then install guest additions as,

cd /media/cdrom; sh ./

After this if you intend to mount the shared folder from host OS, configure so from the menu and then reboot the guest os. Later on confirm if it is mounted by running following command.

mount | grep vbox
  1. Understand the networking requirements of the VirtualBox, DHCP, static IP for MAC of VM, etc.

  2. Install Guest Additions for the OS you are using in VM
  3. Update the install VM using “yum update”

  4. Clone VM to create multiple nodes as per your choice

  5. Please note that after cloning, there can be issue of VM not obtaining IP address from DHCP server. to fix this, do following and also delete /etc/udev/rules.d/70-persistent-net.rules file                                    vi /etc/sysconfig/network-scripts/ifcfg-eth0                            to look like below
  6. DEVICE="eth0"
    HWADDR=MAC Address*System MAC*
  7. Ensure all VMs get static IP

  8. Add IP entries in the parent computer /etc/hosts for all VMs

  9. generate SSH keys on parent computer and all VMs (Refer to this URL for all steps involved in setting up passwordless SSH:

  10. copy ssh public key of parent computer to all VMs

  11. Set Time Daemon NTPD on all of the virtual nodes. This is required step as later the cluster will give issues while running if the time is not in sync on all the servers.  chkconfig ntpd on; service ntpd start
  12. copy public key of Cloudera Manager (VM1) to all other VMs

  13. You need DNS to be configured for hostname resolution, else edit /etc/hosts file on each VM to include self and other participating VM node names and IP addresses.
  14. Get cloudera yum repo

  15. Get cloudera manager Yum repo

  16. Install MySQL on VM1 (if not already done so)

  17. Install JAVA development kit and runtime (latest version: 1.6.0_37)
  18. install cloudera manager

Installing JAVA

install using *rpm.bin (both JRE and JDK)

Set following in your profile/.bashrc (for all nodes)

#Hadoop needed settings
export JAVA_HOME=”/usr/java/jre1.6.0_37″
export PATH=$JAVA_HOME/bin:$PATH

Installing MySQL

sudo yum install mysql-server

SQL Connector is required if Hadoop manager is to be configured to use MqSQL

yum install mysql-connector-java

Automated installation of Hadoop

Cloudera Manager install via yum repository

yum install cloudera-manager-server cloudera-manager-daemons cloudera-manager-agent

  • Using SSH, discover the cluster hosts you specify via IP address ranges or hostnames
  • Configures the package repositories for Cloudera Manager, CDH3 and the Oracle JDK
  • Install the Cloudera Manager Agent and CDH3 (including Hue) on the cluster hosts
  • Install the Oracle JDK if it’s not already installed on the cluster hosts
  • Determine mapping of services to host
  • Suggest a Hadoop configuration and start the Hadoop services

cloudera-daemons are needed by cloudera-manager

Deployment Architecture

CentOS-VM1  – Cloudera Manager

CentOS-VM2…n => HADOOP nodes

Hadoop installation using puppet

What it takes to start automated installation of Hadoop on Minimal CentOS installation

1. SCP command

VBoxHeadless -startvm “CentOS-VM1”

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming*.jar -file /tmp/text_bak/ -mapper /tmp/text_bak/ -file /tmp/text_bak/ -reducer /tmp/text_bak/ -^Cput /tmp/text/* -output /tmp/

Leave a comment

Posted by on December 22, 2012 in Uncategorized


Clean UNinstall Hortonworks HDP 2.2

I love Hadoop and Hortonworks is one of my favored Hadoop distributuion. However while experimenting with the hadoop installation, I had many instances when I needed to start afresh on the set of physical as well as virtual Hadoop cluster. Hortonworks provide great documentation, but as of today I found it is not complete when it comes to uninstalling their distribution. This post aims at creating a small guide on what all one might need to uninstall/clean to bring the cluster to original state as before you started installing Hadoop on it for first time. I have tried these steps on my Linux cluster having HDP 2.2 installed (which was earlier upgraded from HDP 2.1).

Software/Hardware Specifications:

Operating System: RedHat 2.6.32-504.1.3.el6

Apache Ambari Version: 1.7 (Hortonworks Weblink)

Hadoop Distribution: HDP 2.2

Tools used: Yum, pdsh, Various Linux Commands

Tip:- Use PDSH to get rid of running commands on individual servers and at the same time get nice history of installation process.

STEP I – Stop all services

For this you can navigate to your Ambari UI “http://<Ambari-Server FQDN>:<Ambari-Server Port>“. Use “Stop All Services” option. Ensure that all of the service indicator icons flash RED which means services are no more running. If there are issues while stopping services, you will need to refer to error logs from Ambari UI to find issue and fix it.

STEP IIStop Ambari Server & Clients

$ssh root@<Ambari-Server>

$ambari-server stop

$pdsh -a ambari-agent stop | dshbak

Note:- Ensure from the output that ambari-server and ambari-agent has been stopped on all of the servers.

STEP IIIRemove Hadoop packages

$ pdsh -a yum -y remove `yum list installed | grep -i hadoop | cut -d. -f1 | sed -e :a -e '$!N; s/\n/ /; ta'` | dshbak

$ pdsh -a yum -y remove ambari* | dshbak

$ pdsh -a yum -y remove `yum list installed | grep -w 'HDP' | egrep -v -w 'pdsh|dshbak' | cut -d. -f1 | grep -v "^[ ]" | sed -e :a -e '$!N; s/\n/ /; ta'`| dshbak

$ pdsh -a yum -y remove `yum list installed | egrep -w 'hcatalog|hive|hbase|zookeeper|oozie|pig|snappy|hadoop-lzo|knox|hadoop|hue' | cut -d. -f1 | grep -v "^[ ]" | sed -e :a -e '$!N; s/\n/ /; ta'`|dshbak

STEP IV – Uninstall databases used by HDP

Note:- You my skip this step if you want to retain the same for any specific reason. I have demonstrated uninstalling MySQL and PostgreSQL databases, however in case you have any other DB used (such as Oracle), please refer to specific database manuals for the uninstalling the same.

#I had multiple MYSQL servers in my hadoop clusters, so I used pdsh to remove all server and client components at once

$ pdsh -a yum -y remove mysql mysql-server | dshbak

#I do not wanted to backup old mysql data, hence deleting all of that. You might want to save a copy.

$ pdsh -a rm -rf /var/lib/mysql

$ pdsh -a yum -y remove postgre* | dshbak

STEP V – Remove all Hadoop related folders/logs/etc

$ pdsh -a rm -r `find /etc -maxdepth 1 | egrep -wi 'mysql|hcatalog|hive|hbase|zookeeper|oozie|pig|snappy|hadoop|knox|hadoop|hue|ambari|tez|flume|storm|accumulo|spark|kafka|falcon|slider|ganglia|nagios|phoenix' | sed -e :a -e '$!N; s/\n/ /; ta'`

$ pdsh -a rm -r `find /var/log -maxdepth 1 | egrep -wi 'mysql|hcatalog|hive|hbase|zookeeper|oozie|pig|snappy|hadoop|knox|hadoop|hue|ambari|tez|flume|storm|accumulo|spark|kafka|falcon|slider|ganglia|nagios|phoenix' | sed -e :a -e '$!N; s/\n/ /; ta'`

$ pdsh -a rm -r `find /tmp -maxdepth 1 | egrep -wi 'hadoop' | sed -e :a -e '$!N; s/\n/ /; ta'`

#You would have defined hadoop data/name node folder/partition. Please ensure you delete it from all of the nodes. In my case it was /hodoop

$ pdsh -a rm -r /hadoop

STEP VI – Reboot all server

This step is to ensure that any run-away processes are cleaned and the system returns to sane stage. Follow step for each server in your cluster. If the server from which you are going to run following command is part of the Hadoop cluster, then reboot it after all are rebooted.

$ ssh root@<Cluster-Node-FQDN> shutdown -r now

#Wait for sometime before checking if all the servers are rebooted

$pdsh -a uptime | dshbak

STEP VII – Update software packages on each node to the latest.

This step will ensure that your OS packages are latest from the repository and hence no surprises during next round of installation. Repeat following command for each node in the Hadoop cluster.

$ ssh root@<Cluster-Node-FQDN> yum -y update

I hope I covered all the steps required to uninstall Hortonworks Hadoop HDP 2.2, if you find any discrepancies or issues after you follow these steps, please let me know and I will be glad to update my blog.

The very next step you might be looking to is installing hadoop and I strongly believe Hortonworks team has done a great job to document the detailed process at,

Installing Hortonworks Hadoop HDP 2.2

Enjoy Hadooping …

Leave a comment

Posted by on February 15, 2015 in Uncategorized


Tags: ,

Hadoop Mindmap

I have been going through list of tools and frameworks that work with Hadoop to solve various use cases. What I found that there is really a long list and it is growing at rapid rate as those who venture to design a framework or tool on hadoop do not find right fitment of existing ones to their requirement. At the same time an architect like me who want’s to integrate the enterprise data sets onto hadoop platform and derive analytical results, finds it difficult to have all available tools in a single view and categorized in such a way that decision making becomes easy.

I hope my initiative to build a Mind Map that will create a ready to use reference to the tools, frameworks for architects designing and building Big Data solutions for various enterprises. Please refer to following quick reference card and feel free to save it for offline reference.

P.S. I intend to share a map with relevant links to web links referring to tools and frameworks in a short while after this post.Hadoop_mindmap

Leave a comment

Posted by on January 28, 2015 in Uncategorized



Oracle like “Dummy” table for Hive

The seasoned oracle users might be missing the “DUMMY” table in Hive. Even to run specific in-built/user defined function you might be wondering if there could be a “dummy” table. Here is how you create it.

1) Create a text file representing contents of the dummy table

$echo 'X' &gt; /tmp/dual.txt

2) Create a hive table

hive> create table dual(dummy string);
#Its always good to check details
hive> describe extended dual;
#Load the data in table

3) Demonstrate the use of dual table

#List available functions in Hive
hive> show functions;
#Demonstrate date_add
hive> SELECT date_add('2010-12-31', 1) from dual;
hive> SELECT upper('demo string') from dual;

Note:- Please be aware that for each of the above sql statements, hive will launch MapReduce job, so it might annoy you. But if the functions are used with the large tables, it will be negligible overhead as the final set of mapreduce tasks will be optimized by the framework

1 Comment

Posted by on April 22, 2014 in Uncategorized


My Hadoop Commands & Recipes Collection

This is the collection of Hadoop commands that either was not known to me earlier or those that I want to keep as a handy collection as I continue my journey as BigData Solution Architect.

1) How to view the zipped file stored in Hadoop

$ hadoop fs -cat <file Location in HDFS> | <zip cat program> | less
   Zip cat programs: zcat, bzcat, gzcat
   Use less or more as per your choice and need

2) File System Operations through Hive CLI. This comes handy when doing analysis of tables in hive, but need to reference to either HDFS/local file system

#List all files in HDFS
hive>dfs -ls <file/dir in HDFS> ;
#List all files in local file system
hive>!ls <local file/dir path> ;

3) Hive Vs Beeline CLI & some useful related commands/tips

Purpose: This is the way to connect to Hive and run all the hive operations from remote server which may/may not be the part of hadoop cluster

=> First Download the Hadoop tarball (complete distribution as tar gzip files

=> Second unzip and untar all the packages (hadoop, zookeeper, pig, sqoop, flume, hbase, hive, mahout, oozie) and set HADOOP_HOME to the folder where Hadoop base package is extracted.

=> beeline binary is available in hive/bin folder

#Connect to hive2 server
$ beeline
beeline> !connect jdbc:hive2://<Hive2 Server FQDN>:10000 scott tiger org.apache.hive.jdbc.HiveDriver
#In above statement, provide fully qualified domain name/IP address of the host where Hive2 server is running
0: jdbc:hive2://<Hive2 Server FQDN:10> show databases;
#Like above Hive QL, all the other valid operations can now be performed
#You can connect to multiple Hive servers using the same beeline interface and execute all the commands on all connections at once.
#Once done, type '!quit' to quit from the prompt
#same command can be run non-interactively as,
$./beeline -u "jdbc:hive2://<Hive2 Server FQDN>:10000" -n scott -p tiger -d org.apache.hive.jdbc.HiveDriver -e "show databases"

For more information on Beeline, please refer to following URLs,

Apache Wiki on Beeline

Migrating from Hive CLI to beeline: A Primer (by Cloudera)


4) Oracle like “dummy” table in Hive

Please refer following URL of my post on this topic:


5) Creating and using UDF in Hive

I will simply refer to Matthew’s blog as I find it having all finer details.


6) How to create Hive ORC file explicitly <TBD>

many more to be available as I learn…


Hadoop on Virtualization – Benchmarks

This article talks about my personal experience of bench-marking the Hadoop cluster running on Virtual Nodes. The purpose of the post is to share the findings and invite suggestions/questions from the developer community working on Hadoop technology.

Host Machine Configuration

CPU/Processing Power: Intel i5 3570K

Memory: 16GB

Common Guest Machine Configuration

CPU: 1

Chipset: PIIX3

Memory: 2GB

Disk: 15GB

OS: CentOS 6

Hadoop Distribution: HortonWorks HDP 2.0

Hadoop Cluster Details

4 Data Nodes, 1 Name Node, 1 Secondary Name Node

Installed Hadoop modules:

Service Version Description
HDFS Apache Hadoop Distributed File System
YARN + MapReduce2 Apache Hadoop NextGen MapReduce (YARN)
Nagios 3.5.0 Nagios Monitoring and Alerting system
Ganglia 3.5.0 Ganglia Metrics Collection system
Hive Data warehouse system for ad-hoc queries & analysis of large datasets and table & storage management service
HBase Non-relational distributed database and centralized service for configuration management & synchronization
Pig Scripting platform for analyzing large datasets
Sqoop Tool for transferring bulk data between Apache Hadoop and structured data stores such as relational databases
Oozie System for workflow coordination and execution of Apache Hadoop jobs. This also includes the installation of the optional Oozie Web Console which relies on and will install the ExtJS Library.
ZooKeeper Centralized service which provides highly reliable distributed coordination

Performance Benchmarks

TestDFSIO Suite

Goal: to Stress Test HDFS I/O

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- TestDFSIO -write -nrFiles 10 -fileSize 100

INFO fs.TestDFSIO: —– TestDFSIO —– : write
INFO fs.TestDFSIO:            Date & time: Thu Feb 06 16:46:15 IST 2014
INFO fs.TestDFSIO:        Number of files: 10
INFO fs.TestDFSIO: Total MBytes processed: 1000.0
INFO fs.TestDFSIO:      Throughput mb/sec: 13.642750924296376
INFO fs.TestDFSIO: Average IO rate mb/sec: 13.67164421081543
INFO fs.TestDFSIO:  IO rate std deviation: 0.6178493623818809
INFO fs.TestDFSIO:     Test exec time sec: 255.257

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- TestDFSIO -read -nrFiles 10 -fileSize 100

INFO fs.TestDFSIO: —– TestDFSIO —– : read
INFO fs.TestDFSIO:            Date & time: Thu Feb 06 16:49:38 IST 2014
INFO fs.TestDFSIO:        Number of files: 10
INFO fs.TestDFSIO: Total MBytes processed: 1000.0
INFO fs.TestDFSIO:      Throughput mb/sec: 220.799293442261
INFO fs.TestDFSIO: Average IO rate mb/sec: 307.33868408203125
INFO fs.TestDFSIO:  IO rate std deviation: 121.02754910236509
INFO fs.TestDFSIO:     Test exec time sec: 104.963

TeraSort Suite (Work Underway, incomplete)

Goal: to sort the data as fast as possible

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples- teragen 10000000 /Data/terasort-input

Bytes Written=1000000000

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples- terasort /Data/terasort-input /Output/terasort-output

Average Map Time 27sec

Average Reduce Time 48sec

Average Shuffle Time 45sec

Average Merge Time 0sec

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples- teravalidate /Output/terasort-output /Output/terasort-validate
Read the rest of this entry »

Leave a comment

Posted by on February 7, 2014 in Uncategorized




Slide1 Slide2 Slide3 Slide4 Slide5 Slide6 Slide7 Slide8 Slide9 Slide10 Slide11 Slide12 Slide13 Slide14 Slide15 Slide16 Slide17 Slide18 Slide19

Leave a comment

Posted by on May 25, 2013 in knowledge, technology




My NoSQL Understanding

My Understanding of NoSQL World

Cassandra powerful NoSQL database


Posted by on May 11, 2013 in knowledge, technology