Create a Hadoop Build and Development Environment

Hadoop DevelopmentOne of the first things I had to do when I started working with Hadoop was fix bugs within the Hadoop stack. To be able to work on Hadoop internals requires numerous programming tools and libraries.

If you have a desire or need to work on Hadoop code, I’ve summarized the packages you need to install and configure to create a Hadoop build and development environment.

Base Operating System

By far the easiest operating system to set up for Hadoop development is a RedHat derived distro. I highly recommend CentOS 6.x – I use CentOS 6.3 64 bit. To limit the scope of this article I’m going to assume you have a 64 bit CentOS system to work with so I won’t describe the installation procedure here.

Install Oracle JDK 1.6

CentOS normally comes with the OpenJDK Java environment. This is not the version of Java you want to use for Hadoop development. Instead you should install Oracle’s official Java 1.6 JDK and remove OpenJDK. Note you have to run yum as root to be able to install packages on your system.

  1. Remove OpenJDK.
    yum -y remove *jdk*
    yum -y remove *java*
  2. Get Oracle’s Java 1.6 JDK. I suggest downloading the rpm.bin version.
  3. Install JDK 1.6 by double clicking on on the rpm.bin package.

Install CentOS Packages

Install the following CentOS packages using the yum commands as shown. Note some of the packages may already be installed.

yum -y install gcc-c++.x86_64
yum -y install make.x86_64
yum -y install openssl.x86_64 openssl-devel.x86_64 openssh.x86_64
yum -y install libtool.x86_64
yum -y install autoconf.noarch automake.noarch
yum -y install cmake.x86_64
yum -y install xz.x86_64 xz-devel.x86_64 
yum -y install zlib.x86_64 zlib-devel.x86_64
yum -y install git.x86_64

Install Snappy Libraries

You need to get the snappy libraries from the RPMforge repository. Here is what you do to get the RPMforge repo file and snappy library:

  1. Click here to get the the RPMforge repo file – http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm.
  2. Install the repo file by typing:
    rpm -Uvh rpmforge-release-0.5.2.2.el6.rf.x86_64.rpm
  3. Use yum to get the snappy lib:
    yum -y install snappy.x86_64 snappy-devel.x86_64

Install Protobuf

Protocol Buffers are used internally by Hadoop for RPC.  Install this facility as follows:

  1. Download protobuf-2.4.1.tar.gz.
  2. Unpack and build:
    tar zxvf protobuf-2.4.1.tar.gz
    cd protobuf-2.4.1
    ./configure
    make
    sudo make install

Install Apache and Findbugs Tools

Last but by no means least you’ll need the Findbugs and the Apache development tools: maven, ant and ivy. The CentOS packages of the Apache tools are usually not what you want.  That may change in the future but in the meantime follow the instructions here to obtain  and install the latest tools.

  1. Download Apache Maven.
  2. Download Apache Ant.
  3. Download Apache Ivy.
  4. Download Findbugs.
  5. Install each package as follows:
    tar zxvf <maven package>.tgz
    tar zxvf <ant package>.tgz
    tar zxvf <ivy package>.tgz
    tar zxvf <findbugs package>.tgz
    sudo cp -R <maven directory> /usr/local/apache_maven/
    sudo cp -R <ant directory> /usr/local/apache_ant/ 
    sudo cp -R <ivy directory> /usr/local/apache_ivy/
    sudo cp -R <findbugs> /usr/local/findbugs/
  6. Set your .bash_profile or .bashrc to include these environment variables:
    export FB_HOME=/usr/local/findbugs
    export ANT_HOME=/usr/local/apache_ant
    export IVY_HOME=/usr/local/apache_ivy
    export M2_HOME=/usr/local/apache_maven
    
    export JAVA_HOME=/usr/java/default
    
    PATH=$PATH:$M2_HOME/bin:$IVY_HOME/bin:$ANT_HOME/bin:$FB_HOME/bin::$IDEA_HOME/bin
    
    export PATH

Get and Build Hadoop from Trunk

Hadoop code is maintained on Github so that users and developers can easily pull down various versions. Here is the procedure for getting the latest and greatest code from trunk and building the Hadoop jars:

  1. Get the latest Hadoop trunk and place it in the directory named hadoop with this command:
    git clone git://git.apache.org/hadoop-common.git hadoop
  2. Build all the Hadoop jars with maven as follows:
    mvn clean install -DskipTests -Pdist

If all goes well you’ll have all the Hadoop jars and source so you can work on Hadoop internals or debug MapReduce applications.

When I wrote this article I pulled down hadoop-3.0.0-SNAPSHOT from trunk and placed the files in ${HOME}/apache/hadoop. After building in step 2, the  Hadoop distribution is found ${HOME}/apache/hadoop/hadoop-dist/target/hadoop/hadoop-3.0.0-SNAPSHOT on my system.

Development IDEs

All you have left to do is get an development IDE. Most people use Eclipse for Java development, but that is not my favorite by a long shot. I much prefer IntelliJ Community Edition, which is free or Ultimate Edition, which has various commercial licenses.

In a subsequent blog, I’ll show you how to use IntelliJ to debug Hadoop applications. Stay tuned…

Author: 

Comments on this post

  1. MKRao

    Hi Vic,
    Thanks! Very informative. It would have been great if you include the hardware requirements as well for this setup.
    Regards
    -MKRao

  2. Mike Leonard

    Hi Vic,

    Great Job. This was a huge help. I followed your instructions and built Hadoop for the first time.

    I did find one minor typo, based on how the directory structure was created in earlier steps there should be an _ instead a – in the directory names (i.e. apache_ant, apache_ivy, apache_maven)

    export ANT_HOME=/usr/local/apache-ant
    export IVY_HOME=/usr/local/apache-ivy
    export M2_HOME=/usr/local/apache-maven

    Thanks,
    Mike

    • vic

      Thanks for the correction Mike. I’m glad you found the article useful.

  3. Pete

    Thanks Vic, great article. I’m still struggling to actuall install hadoop: compile/package etc with maven all work fine, but I’m not aware of steps that assume you have hadoop in /usr/local/hadoop/bin directories for example. I’m trying manual copies to do this stuff, but not sure that’s correct – I was hoping mvn deploy would do it, but not sure it does.

    • vic

      It doesn’t matter where you install hadoop or JDK 1.6 as long as you get the steps I’ve outlined. Make sure you have installed all the dependencies.

  4. Mike

    Maven 3.0.5 fails on compile at Apache Hadoop Common. I have verified my EXPORTs for ant, ivy, maven and findbugs are all good per instructions. Not sure what to look for in the POM file or my envionment to get a clean compile. Any thoughts?

    • My only suggestion is to make sure you have all the dependencies. Here is a script you can use on CentOS to make sure you get it all:

      #!/bin/bash

      echo “Installing basic dev RPMs…”
      yum -y install gcc-c++.x86_64
      yum -y install make.x86_64
      yum -y install openssl.x86_64 openssl-devel.x86_64 openssh.x86_64
      yum -y install libtool.x86_64
      yum -y install autoconf.noarch automake.noarch
      yum -y install cmake.x86_64
      yum -y install xz.x86_64 xz-devel.x86_64
      yum -y install zlib.x86_64 zlib-devel.x86_64
      yum -y install git.x86_64

      echo “Installing snappy-devel…”
      rpm -Uvh http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
      yum -y install snappy-devel.x86_64

      echo “Installing protobuf…”
      wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
      tar zxvf protobuf-2.5.0.tar.gz
      cd protobuf-2.5.0
      ./configure
      make
      make install

      echo “Downloading Apache tools…”
      wget http://www.interior-dsgn.com/apache/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz
      wget http://supergsego.com/apache/ant/binaries/apache-ant-1.9.4-bin.tar.gz
      wget http://www.interior-dsgn.com/apache/ant/binaries/apache-ant-1.9.3-bin.tar.gz
      wget http://www.interior-dsgn.com/apache/ant/ivy/2.3.0/apache-ivy-2.3.0-bin.tar.gz

      echo “Installing Apache tools…”
      tar zxf apache-maven-3.0.5-bin.tar.gz
      tar zxf apache-ant-1.9.4-bin.tar.gz
      tar zxf apache-ivy-2.3.0-bin.tar.gz

      cp -R apache-maven-3.0.5 /usr/local/maven
      cp -R apache-ant-1.9.4 /usr/local/ant
      cp -R apache-ivy-2.3.0 /usr/local/ivy

      Actually now that I look at this script, it doesn’t have the findbugs part, but that won’t matter because I used this to set up many systems. I hope this helps.

      • Mike

        Got Maven 3.2.3, Java 7 and Hadoop 2.5.0 working. I donwloaded the tarballs for protobuf, snappy and zlib and built those on my target system using their .Configure scripts.

        For Hadoop 2.5.0:
        I executed the following command before any other Maven compile/package/etc.
        mvn install -DskipTests *

        * This is essential, it builds alll of the plugins you will need to do a compile and package. You must have JAVA_HOME, PROTOC, SNAPPY, ZLIB all installed as you point out.

        You then issue:
        mvn compile -Pnative -DskipTests
        mvn package -Pdist,native,docs -Dtar

        And from there update your site config files and start up Hadoop (I am using a Single Node in my dev env).

        I found your website to be a great starting point that showed the key steps needed. I also used the Apache Build Instructions for Hadoop (build.txt).

        Many thanks Vic

  5. iceberrg

    Firstly, Thank you so much for this tutorial. And Could you please change the following code ?

    yum -y install snappy.x86-64 snappy-devel.x86_64

    from this,

    yum -y install snappy.x86_64 snappy-devel.x86_64

    I think underscore was wrong !!!

    • Yes good catch. I’ve fixed that line in the article. Thanks for catching this error.

  6. sp

    hello, i have build the source code using above steps!!Thanks. But what if i need to modify source code of hadoop and test it?

  7. hello sir.
    the steps are really helpfull.can we change the hadoops JobInProgress.java file and get to know where the jar files of this will be creted..?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Trackbacks and Pinbacks on this post

  1. […] my last blog, I explained how to create and configure a Hadoop development environment so that you can build the […]

TrackBack URL