Hadoop之MapReduce程序分析

106次阅读

共计 2798 个字符，预计需要花费 7 分钟才能阅读完成。

摘要：Hadoop 之 MapReduce 程序包括三个部分：Mapper，Reducer 和作业执行。本文介绍和分析 MapReduce 程序三部分结构。

关键词：MapReduce Mapper Reducer 作业执行

MapReduce 程序包括三个部分，分别是 Mapper，Reducer 和作业执行。

————————————– 分割线 ————————————–

Ubuntu 13.04 上搭建 Hadoop 环境 http://www.linuxidc.com/Linux/2013-06/86106.htm

Ubuntu 12.10 +Hadoop 1.2.1 版本集群配置 http://www.linuxidc.com/Linux/2013-09/90600.htm

Ubuntu 上搭建 Hadoop 环境（单机模式 + 伪分布模式）http://www.linuxidc.com/Linux/2013-01/77681.htm

Ubuntu 下 Hadoop 环境的配置 http://www.linuxidc.com/Linux/2012-11/74539.htm

单机版搭建 Hadoop 环境图文教程详解 http://www.linuxidc.com/Linux/2012-02/53927.htm

Hadoop LZO 安装教程 http://www.linuxidc.com/Linux/2013-01/78397.htm

Hadoop 集群上使用 Lzo 压缩 http://www.linuxidc.com/Linux/2012-05/60554.htm

————————————– 分割线 ————————————–

Mapper

一个类要充当 Mapper 需要继承 MapReduceBase 并实现 Mapper 接口。

Mapper 接口负责数据处理阶段。它采用形式为 Mapper<K1,V1,K2,V2> 的 Java 泛型。这里的键类和值类分别实现了 WritableComparable 接口和 Writable 接口。Mapper 接口只有一个 map()方法，用于处理一个单独的键值对。map()方法形式如下。

public void map(K1 key, V1 value, OutputCollector<K2,V2> output ,Reporter reporter) throws IOException

或者

public void map(K1 key, V1 value, Context context) throws IOException, InterruptedException

该函数处理一个给定的键 / 值对 (K1, V1)，生成一个键 / 值对(K2, V2) 的列表（该列表也可能为空）。

Hadoop 提供的一些有用的 Mapper 实现，包括 IdentityMapper，InverseMapper，RegexMapper 和 TokenCountMapper 等。

Reducer

一个类要充当 Reducer 需要继承 MapReduceBase 并实现 Reducer 接口。

Reduce 接口有一个 reduce()方法，其形式如下。

public void reduce(K2 key , Iterator<V2> value, OutputCollector<K3, V3> output, Reporter reporter) throws IOException

或者

public void reduce(K2 key, Iterator<V2> value, Context context) throws IOException, InterruptedException

当 Reducer 任务接受来自各个 Mapper 的输出时，它根据键 / 值对中的键对输入数据进行排序，并且把具有相同键的值进行归并，然后调用 reduce()函数，通过迭代处理那些与指定键相关联的值，生成一个列表 <K3, V3>（可能为空）。

Hadoop 提供一些有用 Reducer 实现，包括 IdentityReducer 和 LongSumReducer 等。

作业执行

在 run()方法中，通过传递一个配置好的作业给 JobClient.runJob()以启动 MapReduce 作业。run()方法里，需要为每个作业定制基本参数，包括输入路径、输出路径、Mapper 类和 Reducer 类。

一个典型的 MapReduce 程序基本模型如下。

public class MyJob extends Configured implements Tool {

/* mapreduce 程序中 Mapper*/

public static class MapClass extends MapReduceBase implements Mapper<Text,Text,Text,Text> {

public void map(Text key, Text value,

OutputCollector<Text,Text> output,

Reporter reporter) throws IOException {

// 添加 Mapper 内处理代码

}

}

/*MapReduce 程序中 Reducer*/

public static class Reduce extends MapReduceBase

implements Reducer<Text,Text,Text,Text> {

public void reduce<Text key,Iterator<Text> values,

OutputCollector<Text,Text>output,Reporter reporter)

throws IOException {

// 添加 Reducer 内处理代码

}

}

/*MapReduce 程序中作业执行 */

public int run(String[] args) throws Exception {

// 添加作业执行代码

return 0;

}

}