One of the first things I had to do when I started working with Hadoop was fix bugs within the Hadoop stack. To be able to work on Hadoop internals requires numerous programming tools and libraries.
If you have a desire or need to work on Hadoop code, I’ve summarized the packages you need to install and configure to create a Hadoop build and development environment.
Contents
Base Operating System
By far the easiest operating system to set up for Hadoop development is a RedHat derived distro. I highly recommend CentOS 6.x – I use CentOS 6.3 64 bit. To limit the scope of this article I’m going to assume you have a 64 bit CentOS system to work with so I won’t describe the installation procedure here.
Install Oracle JDK 1.6
CentOS normally comes with the OpenJDK Java environment. This is not the version of Java you want to use for Hadoop development. Instead you should install Oracle’s official Java 1.6 JDK and remove OpenJDK. Note you have to run yum as root to be able to install packages on your system.
- Remove Open JDK.
yum -y remove java-1.6.0-openjdk.x86_64
- Get Oracle’s Java 1.6 JDK. I suggest downloading the rpm.bin version.
- Install JDK 1.6 by double clicking on on the rpm.bin package.
Install CentOS Packages
Install the following CentOS packages using the yum commands as shown. Note some of the packages may already be installed.
yum -y install gcc-c++.x86_64 yum -y install make.x86_64 yum -y install openssl.x86_64 openssl-devel.x86_64 openssh.x86_64 yum -y install libtool.x86_64 yum -y install autoconf.noarch automake.noarch yum -y install cmake.x86_64 yum -y install xz.x86_64 xz-devel.x86_64 yum -y install zlib.x86_64 zlib-devel.x86_64 yum -y install git.x86_64
Install Snappy Libraries
You need to get the snappy libraries from the RPMforge repository. Here is what you do to get the RPMforge repo file and snappy library:
- Click here to get the the RPMforge repo file - http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm.
- Install the repo file by typing:
rpm -Uvh rpmforge-release-0.5.2.2.el6.rf.x86_64.rpm
- Use yum to get the snappy lib:
yum -y install snappy.x86-64 snappy-devel.x86_64
Install Protobuf
Protocol Buffers are used internally by Hadoop for RPC. Install this facility as follows:
- Download protobuf-2.4.1.tar.gz.
- Unpack and build:
tar zxvf protobuf-2.4.1.tar.gz cd protobuf-2.4.1 ./configure make sudo make install
Install Apache and Findbugs Tools
Last but by no means least you’ll need the Findbugs and the Apache development tools: maven, ant and ivy. The CentOS packages of the Apache tools are usually not what you want. That may change in the future but in the meantime follow the instructions here to obtain and install the latest tools.
- Download Apache Maven.
- Download Apache Ant.
- Download Apache Ivy.
- Download Findbugs.
- Install each package as follows:
tar zxvf <maven package>.tgz tar zxvf <ant package>.tgz tar zxvf <ivy package>.tgz tar zxvf <findbugs package>.tgz sudo cp -R <maven directory> /usr/local/apache_maven/ sudo cp -R <ant directory> /usr/local/apache_ant/ sudo cp -R <ivy directory> /usr/local/apache_ivy/ sudo cp -R <findbugs> /usr/local/findbugs/
- Set your .bash_profile or .bashrc to include these environment variables:
export FB_HOME=/usr/local/findbugs export ANT_HOME=/usr/local/apache-ant export IVY_HOME=/usr/local/apache-ivy export M2_HOME=/usr/local/apache-maven export JAVA_HOME=/usr/java/default PATH=$PATH:$M2_HOME/bin:$IVY_HOME/bin:$ANT_HOME/bin:$FB_HOME/bin::$IDEA_HOME/bin export PATH
Get and Build Hadoop from Trunk
Hadoop code is maintained on Github so that users and developers can easily pull down various versions. Here is the procedure for getting the latest and greatest code from trunk and building the Hadoop jars:
- Get the latest Hadoop trunk and place it in the directory named hadoop with this command:
git clone git://git.apache.org/hadoop-common.git hadoop
- Build all the Hadoop jars with maven as follows:
mvn clean install -DskipTests -Pdist
If all goes well you’ll have all the Hadoop jars and source so you can work on Hadoop internals or debug MapReduce applications.
When I wrote this article I pulled down hadoop-3.0.0-SNAPSHOT from trunk and placed the files in ${HOME}/apache/hadoop. After building in step 2, the Hadoop distribution is found ${HOME}/apache/hadoop/hadoop-dist/target/hadoop/hadoop-3.0.0-SNAPSHOT on my system.
Development IDEs
All you have left to do is get an development IDE. Most people use Eclipse for Java development, but that is not my favorite by a long shot. I much prefer IntelliJ Community Edition, which is free or Ultimate Edition, which has various commercial licenses.
In a subsequent blog, I’ll show you how to use IntelliJ to debug Hadoop applications. Stay tuned…
Author: Vic Hargrave






1 Comment
Pingback: Debugging Hadoop Applications with IntelliJ | VicHargrave.com