In this post I want to improve the existing hadoop-job I have created earlier: wiki-pagerank-with-hadoop
The goals:
- Update to the new stable hadoop version.
- Use the new API.
- Setup a single-node hadoop-cluster.
- Easily build a job-jar with maven.
I have used version 0.20.204.0 for the initial hadoop-jobs. The hadoop committers have been busy in the mean time, the latest stable version at the moment is available 2.2.0.
1. Update Hadoop Version
Updating to the latest version is just a matter of updating the version in the pom.xml and maven will do the rest for you, as long as the version is available in the repository. In my pom I have configured the mirror ibiblio which contains the latest version of hadoop. The client classes to build a job are now in a separate dependency.
... org.apache.hadoop hadoop-core 2.2.0 org.apache.hadoop hadoop-client 2.2.0 ...
Lets download the new version
mvn clean compile
2. Use new API
In the IDE I noticed that the classes from org.apache.hadoop.mapred.* are not deprecated anymore. In the old version some of the classes are marked as deprecated, because at that time there was a new API available. The old API is un-deprecated, because its here to stay. Note: both old and new api will work.
This is an overview of the most important changes:
What | Old | New |
---|---|---|
classes location | org.apache.hadoop.mapred | org.apache.hadoop.mapreduce |
conf class | JobConf | Configuration |
map method | map(k1, v1, OutputCollector | map(k1, v1, Context) |
reduce method | reduce(k2, Iterator | reduce(k2, Iterable |
reduce method | while (values.hasNext()) | for (v2 value : values) |
map/reduce | throws IOException | throws IOException, InterruptedException |
before map/reduce | setup(Context ctx) | |
after map/reduce | close() | cleanup(Context ctx) |
client class | JobConf / JobClient | Job |
run job | JobClient.runJob(job) | job.waitForCompletion(true) |
Migrating to the new API was not so difficult. Delete all the imports with org.apache.hadoop.mapred.* and re-import with the correct package. Then fix all the errors one by one. You can checkout github with the new code and the changes I made to support the new API. The functionality is not changed.
Old api:
public class RankCalculateMapper extends MapReduceBase implements MapperLongWritable, Text, Text, Text>{ public void map(LongWritable key, Text value, OutputCollectorText, Text> output, Reporter reporter) throws IOException { ... output.collect(new Text(page), new Text("|"+links)); } }
New api:
public class RankCalculateMapper extends MapperLongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { ... context.write(new Text(page), new Text("|" + links)); } }
3. Setup a single-node hadoop-cluster.
I the previous post I had a plugin for eclipse to run the main class directly in the hadoop cluster. Actually the plugin creates the job-jar and sends it to the cluster. Since I have switched from IDE I can’t use the plugin anymore and I want an independent job builder so the job can be created and run without Eclipse. Installing your own single node cluster is very easy these days.
Install a virtual machine: VirtualBox or VMwarePlayer and download a quickstart distribution from your favorite vendor:
- Cloudera: http://go.cloudera.com/vm-download
- Hortonworks: http://hortonworks.com/products/hortonworks-sandbox/
After starting the virtual machine, a ‘cluster’ is available! (Pseudo Distributed mode) Outside the virtual machine, from your own machine, you should be able to access the HUE page http://localhost:8888.
To move files between host and virtual-machine there are multiple options.
- Configure a shared folder.
- Use the HUE webinterface to upload files
- Use secure copy over ssh.
I went for option 3: scp. For that option I need to access the virtual machine over ssh on port 22.
My VirtualBox was configured with network settings NAT and port-forwarding. I have added a port-forwarding rule for host from port 2222 to guest 22.
Now I can copy files from my machine to the virtual machine using:
scp -P 2222 data_subset.xml cloudera@localhost:~
On the virtual machine we should add the data in HDFS on the correct place.
hadoop fs -mkdir wiki hadoop fs -mkdir wiki/in hadoop fs -put data_subset.xml /user/cloudera/wiki/in/
3. Easily build a job-jar with maven.
In the original setup the code and eclipse were tightly coupled. To make a separate jar I used the maven-assembly-plugin. Make sure you do not include the hadoop-common and hadoop-client in your jar by marking these dependencies as scope: provided. The dependencies not marked as scope test/provided like commons-io will be included in your jar.
... maven-assembly-plugin 2.4 jar-with-dependencies org.apache.hadoop hadoop-common 2.2.0 provided org.apache.hadoop hadoop-client 2.2.0 provided ...
Now you can create your job:
mvn assembly:assembly [INFO] --- maven-assembly-plugin:2.4:assembly (default-cli) @ hadoop-wiki-pageranking --- [INFO] Building jar: /Users/abij/projects/github/hadoop-wiki-pageranking/target/hadoop-wiki-pageranking-0.2-SNAPSHOT-jar-with-dependencies.jar [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 2.590s [INFO] Finished at: Fri Jan 31 13:25:18 CET 2014 [INFO] Final Memory: 15M/313M [INFO] ------------------------------------------------------------------------
Now copy the fresh baked job to the cluster and run it.
Local machine:
scp -P 2222 hadoop-wiki-pageranking-*-dependencies.jar cloudera@localhost:~
Virtual machine:
hadoop jar hadoop-wiki-*.jar com.xebia.sandbox.hadoop.WikiPageRanking
Watch the process in HUE: http://localhost:8888/jobbrowser/
Recap
In this blog we have updated the vanilla hadoop map-reduce job from the old API to the new API. Updating was not hard, but touches all the classes. We used maven to generate a job from the code and run it in a cluster. The new source code is available from my Github