Create a Hadoop Build and Development Environment

Hadoop DevelopmentOne of the first things I had to do when I started working with Hadoop was fix bugs within the Hadoop stack. To be able to work on Hadoop internals requires numerous programming tools and libraries.

If you have a desire or need to work on Hadoop code, I’ve summarized the packages you need to install and configure to create a Hadoop build and development environment.

Base Operating System

By far the easiest operating system to set up for Hadoop development is a RedHat derived distro. I highly recommend CentOS 6.x – I use CentOS 6.3 64 bit. To limit the scope of this article I’m going to assume you have a 64 bit CentOS system to work with so I won’t describe the installation procedure here.

Install Oracle JDK 1.6

CentOS normally comes with the OpenJDK Java environment. This is not the version of Java you want to use for Hadoop development. Instead you should install Oracle’s official Java 1.6 JDK and remove OpenJDK. Note you have to run yum as root to be able to install packages on your system.

  1. Remove OpenJDK.
    yum -y remove *jdk*
    yum -y remove *java*
  2. Get Oracle’s Java 1.6 JDK. I suggest downloading the rpm.bin version.
  3. Install JDK 1.6 by double clicking on on the rpm.bin package.

Install CentOS Packages

Install the following CentOS packages using the yum commands as shown. Note some of the packages may already be installed.

yum -y install gcc-c++.x86_64
yum -y install make.x86_64
yum -y install openssl.x86_64 openssl-devel.x86_64 openssh.x86_64
yum -y install libtool.x86_64
yum -y install autoconf.noarch automake.noarch
yum -y install cmake.x86_64
yum -y install xz.x86_64 xz-devel.x86_64 
yum -y install zlib.x86_64 zlib-devel.x86_64
yum -y install git.x86_64

Install Snappy Libraries

You need to get the snappy libraries from the RPMforge repository. Here is what you do to get the RPMforge repo file and snappy library:

  1. Click here to get the the RPMforge repo file – http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm.
  2. Install the repo file by typing:
    rpm -Uvh rpmforge-release-0.5.2.2.el6.rf.x86_64.rpm
  3. Use yum to get the snappy lib:
    yum -y install snappy.x86_64 snappy-devel.x86_64

Install Protobuf

Protocol Buffers are used internally by Hadoop for RPC.  Install this facility as follows:

  1. Download protobuf-2.4.1.tar.gz.
  2. Unpack and build:
    tar zxvf protobuf-2.4.1.tar.gz
    cd protobuf-2.4.1
    ./configure
    make
    sudo make install

Install Apache and Findbugs Tools

Last but by no means least you’ll need the Findbugs and the Apache development tools: maven, ant and ivy. The CentOS packages of the Apache tools are usually not what you want.  That may change in the future but in the meantime follow the instructions here to obtain  and install the latest tools.

  1. Download Apache Maven.
  2. Download Apache Ant.
  3. Download Apache Ivy.
  4. Download Findbugs.
  5. Install each package as follows:
    tar zxvf <maven package>.tgz
    tar zxvf <ant package>.tgz
    tar zxvf <ivy package>.tgz
    tar zxvf <findbugs package>.tgz
    sudo cp -R <maven directory> /usr/local/apache_maven/
    sudo cp -R <ant directory> /usr/local/apache_ant/ 
    sudo cp -R <ivy directory> /usr/local/apache_ivy/
    sudo cp -R <findbugs> /usr/local/findbugs/
  6. Set your .bash_profile or .bashrc to include these environment variables:
    export FB_HOME=/usr/local/findbugs
    export ANT_HOME=/usr/local/apache_ant
    export IVY_HOME=/usr/local/apache_ivy
    export M2_HOME=/usr/local/apache_maven
    
    export JAVA_HOME=/usr/java/default
    
    PATH=$PATH:$M2_HOME/bin:$IVY_HOME/bin:$ANT_HOME/bin:$FB_HOME/bin::$IDEA_HOME/bin
    
    export PATH

Get and Build Hadoop from Trunk

Hadoop code is maintained on Github so that users and developers can easily pull down various versions. Here is the procedure for getting the latest and greatest code from trunk and building the Hadoop jars:

  1. Get the latest Hadoop trunk and place it in the directory named hadoop with this command:
    git clone git://git.apache.org/hadoop-common.git hadoop
  2. Build all the Hadoop jars with maven as follows:
    mvn clean install -DskipTests -Pdist

If all goes well you’ll have all the Hadoop jars and source so you can work on Hadoop internals or debug MapReduce applications.

When I wrote this article I pulled down hadoop-3.0.0-SNAPSHOT from trunk and placed the files in ${HOME}/apache/hadoop. After building in step 2, the  Hadoop distribution is found ${HOME}/apache/hadoop/hadoop-dist/target/hadoop/hadoop-3.0.0-SNAPSHOT on my system.

Development IDEs

All you have left to do is get an development IDE. Most people use Eclipse for Java development, but that is not my favorite by a long shot. I much prefer IntelliJ Community Edition, which is free or Ultimate Edition, which has various commercial licenses.

In a subsequent blog, I’ll show you how to use IntelliJ to debug Hadoop applications. Stay tuned…

Author: 

11 thoughts on “Create a Hadoop Build and Development Environment

  1. Pingback: Debugging Hadoop Applications with IntelliJ | VicHargrave.com

  2. Hi Vic,
    Thanks! Very informative. It would have been great if you include the hardware requirements as well for this setup.
    Regards
    -MKRao

  3. Hi Vic,

    Great Job. This was a huge help. I followed your instructions and built Hadoop for the first time.

    I did find one minor typo, based on how the directory structure was created in earlier steps there should be an _ instead a – in the directory names (i.e. apache_ant, apache_ivy, apache_maven)

    export ANT_HOME=/usr/local/apache-ant
    export IVY_HOME=/usr/local/apache-ivy
    export M2_HOME=/usr/local/apache-maven

    Thanks,
    Mike

  4. Thanks Vic, great article. I’m still struggling to actuall install hadoop: compile/package etc with maven all work fine, but I’m not aware of steps that assume you have hadoop in /usr/local/hadoop/bin directories for example. I’m trying manual copies to do this stuff, but not sure that’s correct – I was hoping mvn deploy would do it, but not sure it does.

    • It doesn’t matter where you install hadoop or JDK 1.6 as long as you get the steps I’ve outlined. Make sure you have installed all the dependencies.

  5. Maven 3.0.5 fails on compile at Apache Hadoop Common. I have verified my EXPORTs for ant, ivy, maven and findbugs are all good per instructions. Not sure what to look for in the POM file or my envionment to get a clean compile. Any thoughts?

  6. Firstly, Thank you so much for this tutorial. And Could you please change the following code ?

    yum -y install snappy.x86-64 snappy-devel.x86_64

    from this,

    yum -y install snappy.x86_64 snappy-devel.x86_64

    I think underscore was wrong !!!

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>