Friday, 3 March 2017

Coffee with Word Count Program(Map Reduce)

In the present digital era, we know that amount of data is going to be increased day by day. Over trillions of data is generated worldwide in a single day. Indeed, it’s a huge amount. For Analyzing these huge data, we have to deal with Hadoop(framework). Hadoop is a framework is used to analyzing the big data using various concepts such that Map Reduce, Hive and Pig etc. Today I am going to introduce you about Map Reduce. Map Reduce is a paradigm which is used to analyzing the big data using MR concept. MapReduce is worked on three Classes: -



Map Class: -
Map Class is responsible to read the input data and splits them into key-value pairs using the InputFormatClass. Map Class takes four data types for Mapper Input, Mapper Output, Reducer Input and Reducer Output.Map Class generates intermediate output that will goes to Reducer phase as Reducer Input. Intermediate will be stored in Local File system of Linux.

Reduce Class:-
Reducer class is responsible to read the intermediate output as Reducer Input that comes from Mapper Output. Reducer Class is a final step which gives the final result which would be stored inside the HDFS.

Driver Class:-
Driver Class is responsible for configuration of Job. It consists Input path and output path of file, Input File format and Output File format.


Conversion of Java data types into Hadoop wrapper class

In java, we are well known with Int, float, String, Double data types but in hadoop it will be converting into hadoop wrapper class like:-

Java data types                                                              Hadoop Wrapper Classes
Int                                                                                       IntWritable
Float                                                                                   FloatWritable
String                                                                                 Text
Double                                                                               DoubleWritable

Working of Map Reduce and Related terms.Please refer Map Reduce


Word Count Program using Map Reduce 

public class WordCount{
public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>
//                     K1             V1    K2      V2

// K1,V1 = Input to the Mapper,K2,V2 = Output from Mapper.
     
Input File = "Hadoop is used by Data Engineer Data Engineer "         
{
@Override
// Void map consists of four parameters: -
// First parameter for key,
// Second parameter for Value,
 // Third parameter for Output Collector to collect the output of mapper with same hadoop wrapper classes(data types) as same as input Mapper.
// K1,V1 as Mapper input type should be same as in Output Collector<K1,V1>.
// Reporter is used for getting the progress and debugging of program.
public void map(LongWritable key ,Text Value,OutputCollector<Text,IntWritable>Output,Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenier(line);
While(tokenizer.hasMoreElements()){
        Value.set(tokenizer.nextToken())
// Set is the method to set tokenizer all next values to Value.
        Output.collect(Value, new IntWritable(1))
           //                      Key             Value
        }
     }
}

Intermediate Output
Hadoop,1
is,1 
used,1
by,1
Data ,1
Engineer,1
 Data,1
Engineer,1

After Sorting and Shuffling Phase ,Output would be 
Hadoop,1
is,1 
used,1
by,1
Data ,[1,1]
Engineer,[1,1]

//Reducer Class
Public static class Reduce extends MapReduceBase implements Reduce<Text,IntWritable,Text,IntWritable>
                 K2 ,V2,              K3,       V3
// Output form Mapper should be same as input to Reducer.
i.e. K2,V2 in Mapper = K2,V2 in Reducer
{
Pubilic void reduce(Text key,Iterator<IntWritable>values,OutputCollector<Text,IntWritable>output,Reporter reporter) throws IOException{
Int sum = 0;
While(values.hasNext()){
  Sum +=values.next().get();}
Output.collect(key,new IntWritable(sum));
}
}
//Driver Class
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class); // should be same as main class name
conf.setJobName("wordcount");
conf.setMapperClass(Map.class);// Name of set Mapper Class should be same as we have declared in Mapper class.
conf.setReducerClass(Reduce.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputFormat(TextInputFormat.class); **
conf.setOutputFormat(TextOutputFormat.class); **
** Default input format in Hadoop is TextInputFormat and default Output format is TextOutputFormat.If the program has default input as well as output format then there is no need to write down the input format and output format in the driver class.

FileInputFormat.setInputPaths(conf, new Path(args[0]));
//Input path for Input file
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
//Output file directory path where the Reducer output would be stored(Directory Path).

Path outputPath = new Path(args[1]); **
OutputPath.getFileSystem(conf).delete(outputPath); **
** If output file directory already exists inside the HDFS then it will throw an exception that output file already exist.
JobClient.runJob(conf);
            }
}

Final Output
Hadoop,1
is,1 
used,1
by,1
Data ,2
Engineer,2


We can over come form this problem by directly deleting the output file directory using hadoop shell using

Hadoop fs -rmr /path to your output/

Or through Java API in program
OutputPath.getFileSystem(conf).delete(outputPath);

GitHub link to get full WordCount example WordCount

Steps to submit the job in MR

1.Input file should be presented inside the HDFS.
2.Make jar file of program using Eclipse and IntelliJ Idea editor.
3.Submit the job in Hadoop through below mentioned command
Hadoop jar jarname.jar programname  /Input file name with extension /ouput file directory name/
**Bold words will be changed according to jar name,Input file name and output file directory name


Hope this post will be beneficial for all who have taken the baby step in Hadoop.

No comments:

Post a Comment