Coffee With Hadoop

Sentry

Sentry is an apache integrated project which brings secure, Fine-Grained,role based and multi-tenant administration to Hadoop ecosystem. Sentry is supported with Hive. It acts a layer between SQL parser and query execute engine and verifies the user issuing.

There are some factors that play part in sentry authorization

1. Resource

2. Privileges

3. Roles

4. Users and Groups

Roles:

Roles is a set of privileges. In the data processing, it is a module to combine multiple privileges required for a logical role.

Groups:

A set of users is called a Group. Sentry allows users to associate role to groups.

Actual Problem : We have two tables with name sample_07 and sample_08.We want that

sample_07 should have read and write access/permission for any users and sample_8 should has no access/permission denied for any user.

Configuration of Sentry on Hue with Hive

Following steps are used to configure the Sentry on Hue

1. Go to configuration file section and select [libsentry].

2. Allow Hue to connect to the service by adding the hue user to the following property in the /etc/sentry/conf/sentry-store-site.xml file.

<name>sentry.service.allow.connect</name>

<value>impala,hive,solr,hue</value>

</property>

Following steps need to be taken for resolving the given problem.

Step 1.Click on Security then click on Sentry Table

2.Database and tables privileges à click on default ,then select sample_07.

3. Add or select a role open dialog box will be open.

3.1 Name àWrite the role name.

3.2 Group à Write the group name.

3.3 Click on privileges then click on save.

To avoid unauthorized access from databases (tables) to users.

11.Select default database from “Databases and tables privileges option”.

2 2. “Add or select a role”, dialog box will open.

2.1 Name àWrite the role name.

2.2 Group à Write the group name.

33.Privileges àSelect àAll.

44. Click on with grant option.

Permission by Admin to particular user.

Step 1. Admin àManage users

Step 2. Select permission option àSecurity

Step 3. In the security option , you can choose the user whom you want to provide access of the database.

When other user will login into the Hue then user will not able to see any databases.

In the present digital era, we know that amount of data is going to be increased day by day. Over trillions of data is generated worldwide in a single day. Indeed, it’s a huge amount. For Analyzing these huge data, we have to deal with Hadoop(framework). Hadoop is a framework is used to analyzing the big data using various concepts such that Map Reduce, Hive and Pig etc. Today I am going to introduce you about Map Reduce. Map Reduce is a paradigm which is used to analyzing the big data using MR concept. MapReduce is worked on three Classes: -

Map Class: -

Map Class is responsible to read the input data and splits them into key-value pairs using the InputFormatClass. Map Class takes four data types for Mapper Input, Mapper Output, Reducer Input and Reducer Output.Map Class generates intermediate output that will goes to Reducer phase as Reducer Input. Intermediate will be stored in Local File system of Linux.

Reduce Class:-

Reducer class is responsible to read the intermediate output as Reducer Input that comes from Mapper Output. Reducer Class is a final step which gives the final result which would be stored inside the HDFS.

Driver Class:-

Driver Class is responsible for configuration of Job. It consists Input path and output path of file, Input File format and Output File format.

Conversion of Java data types into Hadoop wrapper class

In java, we are well known with Int, float, String, Double data types but in hadoop it will be converting into hadoop wrapper class like:-

Java data types Hadoop Wrapper Classes

Int IntWritable

Float FloatWritable

String Text

Double DoubleWritable

Working of Map Reduce and Related terms.Please refer Map Reduce

Word Count Program using Map Reduce

public class WordCount{

public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>

// K1 V1 K2 V2

// K1,V1 = Input to the Mapper,K2,V2 = Output from Mapper.

Input File = "Hadoop is used by Data Engineer Data Engineer "

{

@Override

// Void map consists of four parameters: -

// First parameter for key,

// Second parameter for Value,

// Third parameter for Output Collector to collect the output of mapper with same hadoop wrapper classes(data types) as same as input Mapper.

// K1,V1 as Mapper input type should be same as in Output Collector<K1,V1>.

// Reporter is used for getting the progress and debugging of program.

public void map(LongWritable key ,Text Value,OutputCollector<Text,IntWritable>Output,Reporter reporter) throws IOException{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenier(line);

While(tokenizer.hasMoreElements()){

Value.set(tokenizer.nextToken())

// Set is the method to set tokenizer all next values to Value.

Output.collect(Value, new IntWritable(1))

// Key Value

}

Intermediate Output

Hadoop,1

is,1

used,1

by,1

Data ,1

Engineer,1

Data,1

Engineer,1

After Sorting and Shuffling Phase ,Output would be

Hadoop,1

is,1

used,1

by,1

Data ,[1,1]

Engineer,[1,1]

//Reducer Class

Public static class Reduce extends MapReduceBase implements Reduce<Text,IntWritable,Text,IntWritable>

K2 ,V2, K3, V3

// Output form Mapper should be same as input to Reducer.

i.e. K2,V2 in Mapper = K2,V2 in Reducer

{

Pubilic void reduce(Text key,Iterator<IntWritable>values,OutputCollector<Text,IntWritable>output,Reporter reporter) throws IOException{

Int sum = 0;

While(values.hasNext()){

Sum +=values.next().get();}

Output.collect(key,new IntWritable(sum));

}

//Driver Class

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class); // should be same as main class name

conf.setJobName("wordcount");

conf.setMapperClass(Map.class);// Name of set Mapper Class should be same as we have declared in Mapper class.

conf.setReducerClass(Reduce.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setInputFormat(TextInputFormat.class); **

conf.setOutputFormat(TextOutputFormat.class); **

** Default input format in Hadoop is TextInputFormat and default Output format is TextOutputFormat.If the program has default input as well as output format then there is no need to write down the input format and output format in the driver class.

FileInputFormat.setInputPaths(conf, new Path(args[0]));

//Input path for Input file

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

//Output file directory path where the Reducer output would be stored(Directory Path).

Path outputPath = new Path(args[1]); **

OutputPath.getFileSystem(conf).delete(outputPath); **

** If output file directory already exists inside the HDFS then it will throw an exception that output file already exist.

JobClient.runJob(conf);

}

Final Output

Hadoop,1

is,1

used,1

by,1

Data ,2

Engineer,2

We can over come form this problem by directly deleting the output file directory using hadoop shell using

Hadoop fs -rmr /path to your output/

Or through Java API in program

OutputPath.getFileSystem(conf).delete(outputPath);

GitHub link to get full WordCount example WordCount

Steps to submit the job in MR

1.Input file should be presented inside the HDFS.

2.Make jar file of program using Eclipse and IntelliJ Idea editor.

3.Submit the job in Hadoop through below mentioned command

Hadoop jar jarname.jar programname /Input file name with extension /ouput file directory name/

**Bold words will be changed according to jar name,Input file name and output file directory name

Hope this post will be beneficial for all who have taken the baby step in Hadoop.

Coffee With Hadoop

Tuesday, 7 March 2017

Coffee with permission configuration using Sentry on Hive Tables Via Hue

Friday, 3 March 2017

Coffee with Word Count Program(Map Reduce)