Friday, May 10, 2013

How MapReduce Works - Stock Quote Example


MapReduce program has two phases Map phase and Reduce phase. As part of Map phase Shuffle operation is also performed.

  • Map task
  • Shuffle - sort & group
  • Reduce Task



Stock Quote Example

Download historical data for any listed company from google finance.
For example: This link provides historical data for Google stock for last one year in CSV format
This csv file contains fields like Date, Opening Price, High, Low price and closing price for each day.
Sample data below:



Now lets use Hadoop to find out the Highest stock price for a given year.

Mapper - This function extracts required fields from above data and create a file as shown below:


May, 873.88
May, 863.87
May, 861.85
May, 846.8
May, 834.55
May, 824.72
April, 783.75
April, 779.55
April, 786.99
April, 805.75
April, 814.2
April, 814.83
April, 802.25


The Shuffle step sort and group data for each month as shown below:
It sorts the data(key value pair) by key. Now input to reducer function look like:

(May, [ 873.88, 863.87, 861.85, 846.8, 834.55, 824.72 ] )
(April, [775.5 , 778.75 , 786.06, 804.25, 813.46, 804.54, 795.01] )

Now Reducer function loop through this data and find out the highest price for last one year.
(May , 873.88)

Data flow:
input data > Mapper function > Shuffle(sort/ group) > Reducer function > output file


Wednesday, May 8, 2013

How to install hadoop on Linux - Local (Standalone) Mode

 Hadoop can be configured in three different modes:

  1. Local (Standalone) Mode
  2. Pseudo-Distributed Mode
  3. Fully-Distributed Mode

This blog explains Local (Standalone) Mode which is the easiest to configure. And if you just want to get started and run a MapReduce job, you can try this.

Install sun jdk

download jdk 6 (latest update) from http://www.oracle.com/technetwork/java/javase/overview/index.html
 I have downloaded update 37 from below url.
 http://www.oracle.com/technetwork/java/javase/downloads/jdk6u37-downloads-1859587.html

cd /scratch/rajiv/hadoop/hadoop-1.0.4
./jdk-6u37-linux-x64.bin
jdk will be installed under same folder ( jdk1.6.0_37)


Download latest stable hadoop version 

Download hadoop-1.0.4-bin.tar.gz from apache mirror http://www.motorlogy.com/apache/hadoop/common/stable/

Extract hadoop:
 tar -xvf hadoop-1.0.4-bin.tar.gz

Now hadoop is extracted under /scratch/rajiv/hadoop/hadoop-1.0.4


set JAVA_HOME for hadoop

 vi hadoop-1.0.4/conf/hadoop-env.sh

uncomment below line and update path to jdk 1.6

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

I have updated it to  
 export JAVA_HOME=/scratch/rajiv/hadoop/jdk1.6.0_37

Run sample MapReduce program

hadoop-1.0.4-bin.tar.gz has sample programs which are present in hadoop-examples-1.0.4.jar. Lets try Grep.java from this jar.

The source code this class is not present in hadoop distribution. But it can be viewed from Hadoop version control system. Hadoop SVN repository provide option to browse source code onilne.


 To view source code of hadoop version 1.0.4, got branch-1.0

And click on Grep.java and view the revision

This job has three steps:
  • Mapper - Mapper class is set to RegexMapper
  • Combiner - this is set to LongSumReducer
  • Reducer - Reducer class is set to LongSumReducer

Job configuration -  prepare job parameters like input & output folder. Also specify Mapper and Reducer functions

Job client - used to submit job



$cd /scratch/rajiv/hadoop/hadoop-1.0.4
$mkdir inputfiles
$cd inputfiles
$wget http://hadoop.apache.org/index.html
$cd ..
$./bin/hadoop jar hadoop-examples-*.jar grep inputfiles outputfiles 'Apache'

wget command download index.html file to inputfiles folder. Above command read index.html present in inputfiles folder and grep for occurrences of strings 'Apache'. And writes the string and count to outputfiles folder. 


Now examine the output folder

ls outputfiles/
_SUCCESS  part-00000

file part-00000 contains output of the MapReduce job.


cat outputfiles/part-00000
46      Apache



In this mode, hadoop run as a single process.
When the above command is running, run "ps -ef | grep RunJar" from another terminal and it shows that there is a java process running which invokes below:

"org.apache.hadoop.util.RunJar hadoop-examples-1.0.4.jar grep inputfiles"

Source code of org.apache.hadoop.util.RunJar  can be viewed here

RunJar class basically load Grep.class from hadoop-examples-1.0.4.jar and execute the main method.

Note that in Standalone mode hdfs file system is not configured and MapReduce program runs as single java process. To see hadoop in action you would need to configure Pseudo-Distributed Mode or Fully-Distributed Mode.