API, Hadoop has reached the version of 1.0 at present, but all MapReduce tutorials on the Internet still use the old MapReduce API, so I decided to study a new API.
The first thing is to prepare the source file for MapReduce, as follows:
1900,35.3
1900,33.2
....
1905,38.2
1905,37. 1
As shown above, the temperature value of each year is recorded, and now the highest temperature value of each year is needed, which is a typical problem that MapReduce can handle well. In the map stage, [1900, (35.3,
333.2, ...)], ... [1905, (38.2, 37.1,...)], and then calculate the maximum temperature value of each year through the Reduce stage.
The first is to write MapReduce class, which is similar to the old API, but it should be noted that the new package referenced here is: org.apache.hadoop.mapreduce.* instead of the original org.apache.hadoop.mapred.*, and the specific steps are as follows:
Package com.bjcic.hadoop.guide;
Import java.io.bufferedreader;
Import java.io.file;
Import java.io.fileinputstream;
Import java.io.ioexception;
Import java.io.inputstreamreader;
Import org.apache.hadoop.fs.path;
Import org.apache.hadoop.io.doublewriteable;
Import org.apache.hadoop.io.longwriteable;
Import org.apache.hadoop.io.text;
Import org.apache.hadoop.mapreduce.job;
Import org.apache.hadoop.mapreduce.mapper;
Import org.apache.hadoop.mapreduce.reducer;
Import org.apache.hadoop.mapreduce.reducer.context;
Import org.apache.hadoop.mapreduce.lib.input.fileinputformat;
import org . Apache . Hadoop . MapReduce . lib . output . file output format;
Public class MaxTptr {
The public static class MaxTptrMapper extends the Mapper & ltLongWritable text,
Text,DoubleWritable & gt{
@ Overlay
Public empty mapping (LongWritable key, text value, context)
Throw IOException, InterruptedException {
String[] items = value.toString()。 Split (",");
Context.write (new text (item [0]), new
double writable(double . parse double(items[ 1]));
}
}
The public static class MaxTptrReducer extends the text of Reducer & lt, which can be double-written.
Text,DoubleWritable & gt{
@ Overlay
Public void reduce(Text key, Iterable & ltDoubleWritable & gt Values, Background
Context)
Throw IOException, InterruptedException {
double maxTptr = Double。 Minimum value;
for (DoubleWritable val : values) {
maxTptr = Math.max(val.get(),maxTptr);
}
context.write(key,new double writable(maxTptr));
}
}
Public static void main(String[] argv) {
//JobConf conf = new JobConf(maxtptr . class);
Job job = null
Try {
Job = new Job();
job . setjarbyclass(maxtptr . class);
file input format . addinputpath(job,new Path(" input "));
file output format . setoutputpath(job,new Path(" output "));
job . setmapper class(maxtptrmapper . class);
job . setreducerclass(maxtptrreducer . class);
job . setoutputkey class(text . class);
job . setoutputvalueclass(double writable . class);
system . exit(job . wait for completion(true)? 0 : 1 );
} catch (IOException e) {
// TODO automatically generated catch block
e . printstacktrace();
} catch (InterruptedException e) {
// TODO automatically generated catch block
e . printstacktrace();
} catch(ClassNotFoundException e){
// TODO automatically generated catch block
e . printstacktrace();
}
}
}
For the above code, the following points need to be explained:
The extended base classes have changed, namely Mapper and Reducer.
An important context class is introduced.
Submit and run the task using the new class job instead of the old version of JobConf.
To upload source files and src directories to Hadoop machines using pseudo-distributed deployment, Hadoop tasks need to be compiled and packaged first:
Javac-classpath $ Hadoop _ home/Hadoop-core-1.0.1.jar-d class
src/com/bjcic/Hadoop/guide/maxtptr . Java
jar -cvf。 /maxtptr.jar-class c/.
MaxTptr.jar will be generated in the current directory.
Go to $HADOOP_HOME directory, first make sure that HADOOP is running in pseudo-distributed mode, if not, start Hadoop(bin/start-all.sh, note that if the startup is unsuccessful, you can run bin/hadoop first.
Namenode format).
Upload the temperature file to Hadoop's HDFS file system:
$> bin/hadoopdfs-put/data _ dir/temperature.txt input.
Run Hadoop tasks:
$> bin/Hadoop jar maxtptr.jarcom.bjcic.hadoop.guide.maxtptr input
output
Take the output of HDFS as an example:
$ & gtbin/Hadoop DFS- get output/data directory/
At this point, the output directory will be established in the data_dir directory, and the generated result file can be found in this directory.