In
the present digital era, we know that amount of data is going to be increased
day by day. Over trillions of data is generated worldwide in a single day.
Indeed, it’s a huge amount. For Analyzing these huge data, we have to deal with
Hadoop(framework). Hadoop is a framework is used to analyzing the big data
using various concepts such that Map Reduce, Hive and Pig etc. Today I am going
to introduce you about Map Reduce. Map Reduce is a paradigm which is used to analyzing the big data using MR concept. MapReduce is worked on three Classes:
-
Map Class: -
Map
Class is responsible to read the input data and splits them into key-value
pairs using the InputFormatClass. Map Class takes four data types for Mapper
Input, Mapper Output, Reducer Input and Reducer Output.Map Class generates
intermediate output that will goes to Reducer phase as Reducer Input.
Intermediate will be stored in Local File system of Linux.
Reduce Class:-
Reducer
class is responsible to read the intermediate output as Reducer Input that comes
from Mapper Output. Reducer Class is a final step which gives the final result which
would be stored inside the HDFS.
Driver Class:-
Driver
Class is responsible for configuration of Job. It consists Input path and
output path of file, Input File format and Output File format.
Conversion of Java data types into
Hadoop wrapper class
In java,
we are well known with Int, float, String, Double data types but in hadoop it
will be converting into hadoop wrapper class like:-
Java data types Hadoop
Wrapper Classes
Int
IntWritable
Float
FloatWritable
String Text
Double
DoubleWritable
Working of Map Reduce and Related terms.Please refer Map Reduce
Word Count Program using Map Reduce
public
class WordCount{
public
static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>
// K1
V1 K2 V2
// K1,V1 =
Input to the Mapper,K2,V2 = Output from Mapper.
Input File = "Hadoop is used by Data Engineer Data Engineer "
{
@Override
// Void
map consists of four parameters: -
// First
parameter for key,
// Second
parameter for Value,
// Third parameter for Output Collector to
collect the output of mapper with same hadoop wrapper classes(data types) as
same as input Mapper.
// K1,V1
as Mapper input type should be same as in Output Collector<K1,V1>.
//
Reporter is used for getting the progress and debugging of program.
public
void map(LongWritable key ,Text Value,OutputCollector<Text,IntWritable>Output,Reporter
reporter) throws IOException{
String line
= value.toString();
StringTokenizer
tokenizer = new StringTokenier(line);
While(tokenizer.hasMoreElements()){
Value.set(tokenizer.nextToken())
// Set is
the method to set tokenizer all next values to Value.
Output.collect(Value, new IntWritable(1))
// Key Value
}
}
}
Intermediate Output
Hadoop,1
is,1
used,1
by,1
Data ,1
Engineer,1
Data,1
Engineer,1
After Sorting and Shuffling Phase ,Output would be
Hadoop,1
is,1
used,1
by,1
Data ,[1,1]
Engineer,[1,1]
//Reducer
Class
Public static
class Reduce extends MapReduceBase implements Reduce<Text,IntWritable,Text,IntWritable>
K2 ,V2, K3, V3
// Output
form Mapper should be same as input to Reducer.
i.e. K2,V2
in Mapper = K2,V2 in Reducer
{
Pubilic
void reduce(Text key,Iterator<IntWritable>values,OutputCollector<Text,IntWritable>output,Reporter
reporter) throws IOException{
Int sum =
0;
While(values.hasNext()){
Sum +=values.next().get();}
Output.collect(key,new
IntWritable(sum));
}
}
//Driver
Class
public
static void main(String[] args) throws Exception {
JobConf conf
= new JobConf(WordCount.class); // should be same as main class name
conf.setJobName("wordcount");
conf.setMapperClass(Map.class);//
Name of set Mapper Class should be same as we have declared in Mapper class.
conf.setReducerClass(Reduce.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputFormat(TextInputFormat.class);
**
conf.setOutputFormat(TextOutputFormat.class);
**
** Default
input format in Hadoop is TextInputFormat and default Output format is TextOutputFormat.If
the program has default input as well as output format then there is no need to
write down the input format and output format in the driver class.
FileInputFormat.setInputPaths(conf,
new Path(args[0]));
//Input
path for Input file
FileOutputFormat.setOutputPath(conf,
new Path(args[1]));
//Output file
directory path where the Reducer output would be stored(Directory Path).
Path
outputPath = new Path(args[1]); **
OutputPath.getFileSystem(conf).delete(outputPath);
**
** If
output file directory already exists inside the HDFS then it will throw an
exception that output file already exist.
JobClient.runJob(conf);
}
}
Final Output
Hadoop,1
is,1
used,1
by,1
Data ,2
Engineer,2
We can
over come form this problem by directly deleting the output file directory
using hadoop shell using
Hadoop fs
-rmr /path to your output/
Or through
Java API in program
OutputPath.getFileSystem(conf).delete(outputPath);
GitHub
link to get full WordCount example WordCount
Steps to submit the job in MR
1.Input
file should be presented inside the HDFS.
2.Make jar
file of program using Eclipse and IntelliJ Idea editor.
3.Submit
the job in Hadoop through below mentioned command
Hadoop jar
jarname.jar programname /Input file name with extension /ouput file directory
name/
**Bold words will be changed
according to jar name,Input file name and output file directory name
Hope this
post will be beneficial for all who have taken the baby step in Hadoop.