Debugging Hadoop Applications with IntelliJ

Hadoop DebugIn my last blog, I explained how to create and configure a Hadoop development environment so that you can build the jars and example applications from the Hadoop source code you get from the Apache Hadoop trunk repository.

This time around I’ll show you how to debug your Hadoop applications using the IntelliJ Community Edition IDE.  I’m going to discuss two different projects, one to debug the PI estimation program from the Hadoop examples jar file and the other to debug the WordCount application.

Hadoop PI Estimation Example

Open the Hadoop IntelliJ Project

I’m going to assume that you already have all IntelliJ Hadoop development tools you need.

  1. cd into your hadoop directory.
  2. Type the following command to create the Hadoop IntelliJ projects.
    mvn idea:idea
  3. Open IntelliJ.
  4. Select Open… from the top menu bar.
  5. Browse to the hadoop-main.ipr file in your hadoop directory.
  6. Open hadoop-main-ipr.

Create Run and Debug Configuration

Now that you have the Hadoop project, you are going to create a run and debugging configuration for the Hadoop MapReduce example programs. In this case we’ll be debugging the Hadoop PI estimation program.

  1. Select Run > Edit Configurations…  from the top menu bar.
  2. Click on the ‘+‘ symbol in the upper left hand corner of the Run/Debug Configurations screen.
  3. Select Application in the drop down menu.
  4. Enter standalaone as the configuration name.
  5. Enter org.apache.hadoop.util.Runjar the main class.
  6. Enter the location of your hadoop directory as the working directory. In my case it is /home/vic/apache/hadoop. To simplify the nomenclature, I’ll refer to this directory as ${HADOOP} for the remainder of the blog.
  7. The hadoop trunk version that I pulled down is hadoop-3.0.0-SNAPSHOT. The Hadoop examples jar is located at:
    ${HADOOP}/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar

    Click on the program arguments button then enter the path of the Hadoop examples jar and the PI estimation arguments as shown below.

Run-Debug Configuration

  1. Click on the Close button.
  2. Click on the OK button in the Run/Debug Configurations screen.

Debug Hadoop PI Estimation

With the Run/Debug configuration you can either run the PI estimation program straightway or step through it in the IntelliJ debugger. Let’s do some debugging first.

  1. Open the main PI estimation file QuasiMonteCarlo.java from the following location:
    ${HADOOP}/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/QuasiMonteCarlo.java
  2. Next click to the left of the code window on each line where you want to break. Each breakpoint line will have a red circle next to it as shown below.

Break points

  1. To start debugging click on the bug icon in the toolbar at the top of the IntelliJ window.
  2. You’ll see a blue bar at each point where you break and a debug window will open up at the bottom of the window. You can use the debugging controls to the right of the Console tab to step through the code.

Debugging

To run the PI estimation straight through you can click on the green triangle in the toolbar at the top of the IntelliJ window. If you run the program with the arguments entered into the standalone configuration earlier, the output will look like this:

Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
Job Finished in 1.714 seconds
Estimated value of Pi is 3.20000000000000000000

Process finished with exit code 0

WordCount Example

Create a WordCount IntelliJ Project

The process for debugging the WordCount 1.0 example from the Hadoop MapReduce Tutorial is similar to the Hadoop PI Estimation, except this time we have to create am IntelliJ project from scratch.

  1. Create this directory for your Wordcount app:
    ${HOME}/WordCount
  2. Open IntelliJ.
  3. Select New Project… from the top menu bar.
  4. Select Java Module in the New Project screen.
  5. Set the project name to WordCount.
  6. Click Next then OK.

New Project

  1. Right click on the WordCount/src folder in the Project explorer.
  2. Select New > Java Class.
  3. Enter the class name as WordCount.
  4. Click on OK.
  5. Copy the WordCount 1.0 and paste it into your WordCount.java file.
  6. Select File > Save.
  7. Select File > Project Structure…
  8. Select Modules in the Project Structure screen

New Project Structure

  1. Click on ‘+’ in the Dependencies tab
  2. Go to this directory in the Hadoop distribution:
    ${HADOOP_HOME}/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/
  3. Select the subdirectories as shown below:

Add Hadoop jars

  1. Click OK.
  2. Click on ‘+‘ in the Dependencies tab again.
  3. Select this directory in the Hadoop share directory:
    ${HADOOP_HOME}/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/lib
  4. Click OK. Your module structure should look like this:

Project Structure

  1. Still in the Project Structure screen, select Artifacts.
  2. Click on the ‘+‘ at the top of the screen.

Artifacts

  1. Select Add > Jar > Empty from the drop down menu.
  2. Set the artifact name to WordCount.
  3. Set the output directory to:
    ${HOME}/WordCount

WordCount Artifacts

  1. Click on the ‘+‘ between the Output Layout tab.
  2. Select Module Output.
  3. Select WordCount in the Choose Module screen.

Choose Module

  1. Click OK. The artifacts should look like this:

WordCount Module Output

  1. Click on OK in the Project Structure screen.
  2. Build the WordCount jar by selecting Build > Build Artifacts… 
  3. Select WordCount > Rebuild.

Create Run and Debug Configuration

Follow the steps for creating a run and debug configuration discussed previously except for the following settings:

  1. Program arguments should be:
    ${HOME}/WordCount/WordCount.jar input/ output/
  2. Working directory is:
    ${HOME}/WordCount
  3. Set the classpath to WordCount.

Create the Input Files

Finally you have to set up the input directory and files.

  1. Create this directory for the WordCount input files:

    ${HOME}/WordCount/input
  2. Create a text file in this directory called file001.
  3. Put these words in file001: Hello World Bye World.
  4. Create a text file in the same location called file002.
  5. Put these words in file002: Hello Hadoop Goodbye Hadoop.

Debug WordCount

Now you are ready to run or debug WordCount. If you run the program you get the will get a WordCount/output directory with a file called _SUCCESS and the results file named part-00000 that contains the following:

Bye	1
Goodbye	1
Hadoop	2
Hello	2
World	2

Author: 

9 thoughts on “Debugging Hadoop Applications with IntelliJ

  1. Pingback: How to Create a Linux Virtual Machine with VirtualBox | VicHargrave.com

  2. Hi,

    Thanks for this post. The ‘pi’ example worked.
    I’m working with hadoop-2.3.0-src.
    However, I’m missing dependencies in the WordCount example.

    There is no SNAPSHOT or Share directories created.
    ${HADOOP_HOME}/hadoop-dist/target/hadoop-2.3.0-SNAPSHOT/share/hadoop/

    There are some build issues:
    $mvn clean install

    [INFO] Reactor Summary:
    [INFO]
    [INFO] Apache Hadoop Main ………………………….. SUCCESS [ 4.918 s]
    [INFO] Apache Hadoop Project POM ……………………. SUCCESS [ 2.098 s]
    [INFO] Apache Hadoop Annotations ……………………. SUCCESS [ 7.012 s]
    [INFO] Apache Hadoop Project Dist POM ……………….. SUCCESS [ 1.717 s]
    [INFO] Apache Hadoop Assemblies …………………….. SUCCESS [ 0.647 s]
    [INFO] Apache Hadoop Maven Plugins ………………….. SUCCESS [ 7.964 s]
    [INFO] Apache Hadoop MiniKDC ……………………….. SUCCESS [ 44.000 s]
    [INFO] Apache Hadoop Auth ………………………….. SUCCESS [02:51 min]
    [INFO] Apache Hadoop Auth Examples ………………….. FAILURE [ 2.233 s]
    [INFO] Apache Hadoop Common ………………………… SKIPPED
    [INFO] Apache Hadoop NFS …………………………… SKIPPED
    [INFO] Apache Hadoop Common Project …………………. SKIPPED
    [INFO] Apache Hadoop HDFS ………………………….. SKIPPED
    [INFO] Apache Hadoop HttpFS ………………………… SKIPPED
    [INFO] Apache Hadoop HDFS BookKeeper Journal …………. SKIPPED
    [INFO] Apache Hadoop HDFS-NFS ………………………. SKIPPED
    [INFO] Apache Hadoop HDFS Project …………………… SKIPPED
    [INFO] hadoop-yarn ………………………………… SKIPPED
    [INFO] hadoop-yarn-api …………………………….. SKIPPED
    [INFO] hadoop-yarn-common ………………………….. SKIPPED
    [INFO] hadoop-yarn-server ………………………….. SKIPPED
    [INFO] hadoop-yarn-server-common ……………………. SKIPPED
    [INFO] hadoop-yarn-server-nodemanager ……………….. SKIPPED
    [INFO] hadoop-yarn-server-web-proxy …………………. SKIPPED
    [INFO] hadoop-yarn-server-resourcemanager ……………. SKIPPED
    [INFO] hadoop-yarn-server-tests …………………….. SKIPPED
    [INFO] hadoop-yarn-client ………………………….. SKIPPED
    [INFO] hadoop-yarn-applications …………………….. SKIPPED
    [INFO] hadoop-yarn-applications-distributedshell ……… SKIPPED
    [INFO] hadoop-yarn-applications-unmanaged-am-launcher …. SKIPPED
    [INFO] hadoop-yarn-site ……………………………. SKIPPED
    [INFO] hadoop-yarn-project …………………………. SKIPPED
    [INFO] hadoop-mapreduce-client ……………………… SKIPPED
    [INFO] hadoop-mapreduce-client-core …………………. SKIPPED
    [INFO] hadoop-mapreduce-client-common ……………….. SKIPPED
    [INFO] hadoop-mapreduce-client-shuffle ………………. SKIPPED
    [INFO] hadoop-mapreduce-client-app ………………….. SKIPPED
    [INFO] hadoop-mapreduce-client-hs …………………… SKIPPED
    [INFO] hadoop-mapreduce-client-jobclient …………….. SKIPPED
    [INFO] hadoop-mapreduce-client-hs-plugins ……………. SKIPPED
    [INFO] Apache Hadoop MapReduce Examples ……………… SKIPPED
    [INFO] hadoop-mapreduce ……………………………. SKIPPED
    [INFO] Apache Hadoop MapReduce Streaming …………….. SKIPPED
    [INFO] Apache Hadoop Distributed Copy ……………….. SKIPPED
    [INFO] Apache Hadoop Archives ………………………. SKIPPED
    [INFO] Apache Hadoop Rumen …………………………. SKIPPED
    [INFO] Apache Hadoop Gridmix ……………………….. SKIPPED
    [INFO] Apache Hadoop Data Join ……………………… SKIPPED
    [INFO] Apache Hadoop Extras ………………………… SKIPPED
    [INFO] Apache Hadoop Pipes …………………………. SKIPPED
    [INFO] Apache Hadoop OpenStack support ………………. SKIPPED
    [INFO] Apache Hadoop Client ………………………… SKIPPED
    [INFO] Apache Hadoop Mini-Cluster …………………… SKIPPED
    [INFO] Apache Hadoop Scheduler Load Simulator ………… SKIPPED
    [INFO] Apache Hadoop Tools Dist …………………….. SKIPPED
    [INFO] Apache Hadoop Tools …………………………. SKIPPED
    [INFO] Apache Hadoop Distribution …………………… SKIPPED
    [INFO] ————————————————————————
    [INFO] BUILD FAILURE
    [INFO] ————————————————————————
    [INFO] Total time: 04:12 min
    [INFO] Finished at: 2014-08-19T16:24:19-07:00
    [INFO] Final Memory: 41M/87M
    [INFO] ————————————————————————
    [ERROR] Failed to execute goal org.apache.maven.plugins:maven-install-plugin:2.3.1:install (default-install) on project hadoop-auth-examples: Failed to install artifact org.apache.hadoop:hadoop-auth-examples:war:2.3.0: /Users/davidlaxer/.m2/repository/org/apache/hadoop/hadoop-auth-examples/2.3.0/hadoop-auth-examples-2.3.0.war (Permission denied) -> [Help 1]
    [ERROR]
    [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
    [ERROR] Re-run Maven using the -X switch to enable full debug logging.
    [ERROR]
    [ERROR] For more information about the errors and possible solutions, please read the following articles:
    [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
    [ERROR]
    [ERROR] After correcting the problems, you can resume the build with the command
    [ERROR] mvn -rf :hadoop-auth-examples

    Any ideas?

    • Sorry for taking so long to get back to you. I just pulled down the trunk and checked out the 0.2.3.0 branch and everthing built fine for me. I’m thinking you downloaded and expanded the hadoop-common repo with one set privileges – possibly ‘root’ – and tried to build the stuff with another.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>