阿里云-云小站(无限量代金券发放中)
【腾讯云】云服务器、云数据库、COS、CDN、短信等热卖云产品特惠抢购

Hadoop学习:Map/Reduce初探与小Demo实现

178次阅读
没有评论

共计 23445 个字符,预计需要花费 59 分钟才能阅读完成。

一、概念知识介绍

Hadoop MapReduce 是一个用于处理海量数据的分布式计算框架。这个框架解决了诸如数据分布式存储、作业调度、容错、机器间通信等复杂问题,可以使没有并行 处理或者分布式计算经验的工程师,也能很轻松地写出结构简单的、应用于成百上千台机器处理大规模数据的并行分布式程序。

Hadoop MapReduce 基于“分而治之”的思想,将计算任务抽象成 map 和 reduce 两个计算过程,可以简单理解为“分散运算—归并结果”的过程。一个 MapReduce 程序首先会把输入数据分割成不相关的若干键 / 值对(key1/value1)集合,这些键 / 值对会由多个 map 任务来并行地处理。MapReduce 会对 map 的输出(一些中间键 / 值对 key2/value2 集合)按照 key2 进行排序,排序是用 memcmp 的方式对 key 在内存中 字节数组比较后进行升序排序,并将属于同一个 key2 的所有 value2 组合在一起作为 reduce 任务的输入,由 reduce 任务计算出最终结果并输出 key3/value3。作为一个优化,同一个计算节点上的 key2/value2 会通过 combine 在本地归并。基本流程如下:

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

 

Hadoop 和单机程序计算流程对比:

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

 

常计算任务的输入和输出都是存放在文件里的,并且这些文件被存放在 Hadoop 分布式文件系统 HDFS(Hadoop Distributed File System)中,系统会尽量调度计算任务到数据所在的节点上运行,而不是尽量将数据移动到计算节点上,减少大量数据在网络中传输,尽量节省带宽消耗。

应用程序开发人员一般情况下需要关心的是图中灰色的部分,单机程序需要处理数据读取和写入、数据处理;Hadoop 程序需要实现 map 和 reduce,而数据读取和写入、map 和 reduce 之间的数据传输、容错处理等由 Hadoop MapReduce 和 HDFS 自动完成。

二、开发环境搭建

Map/Reduce 程序依赖 Hadoop 集群,另外 Eclipse 需要安装依赖的 hadoop 包。

Hadoop 集群搭建:参考 Hadoop 2.2.0 集群搭建 http://www.linuxidc.com/Linux/2014-06/102865.htm

————————————– 分割线 ————————————–

Ubuntu 13.04 上搭建 Hadoop 环境 http://www.linuxidc.com/Linux/2013-06/86106.htm

Ubuntu 12.10 +Hadoop 1.2.1 版本集群配置 http://www.linuxidc.com/Linux/2013-09/90600.htm

Ubuntu 上搭建 Hadoop 环境(单机模式 + 伪分布模式)http://www.linuxidc.com/Linux/2013-01/77681.htm

Ubuntu 下 Hadoop 环境的配置 http://www.linuxidc.com/Linux/2012-11/74539.htm

单机版搭建 Hadoop 环境图文教程详解 http://www.linuxidc.com/Linux/2012-02/53927.htm

Hadoop LZO 安装教程 http://www.linuxidc.com/Linux/2013-01/78397.htm

Hadoop 集群上使用 Lzo 压缩 http://www.linuxidc.com/Linux/2012-05/60554.htm

 

1. 安装、配置 Eclipse

在官网下载合适的 Eclipse,将 hadoop 开发所依赖的插件 jar 包拷贝到 eclipse 的安装文件夹 plugins 下。Hadoop2.2.0 开发依赖的 jar 包下载地址参考:

—————————————— 分割线 ——————————————

FTP 地址:ftp://ftp1.linuxidc.com

用户名:ftp1.linuxidc.com

密码:www.linuxidc.com

在 2014 年 LinuxIDC.com\6 月 \Hadoop 学习:Map&Reduce 初探与小 Demo 实现

下载方法见 http://www.linuxidc.com/Linux/2013-10/91140.htm

—————————————— 分割线 ——————————————

当然也可以自己编译。

启动 eclipse,选择 Window—>Prefrances,若出现如下 Hadoop Map/Reduce 说明插件安装成功

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

2. 配置 DFS,主要是数据文件的输入输出管理。

Window—>Open Perspective—>other—>Map/Reduce,显示 Map/Reduce 视图。点击 Map/Reduce Locations 的小象图标,新建 Hadoop Location,输入如下:

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

项目视图会出现 DFS Location,用来管理输入、输出数据文件。

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

需要配置 hadoop 安装文件夹:新建 Map/Reduce 工程单击 Configure Hadoop install direction,输入 hadoop 的安装路径。

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

右键单击 DFS Location 下的空文件夹上传一个文本文件,然后刷新,若文件出现了则说明环境配置成功。

更多详情见请继续阅读下一页的精彩内容 :http://www.linuxidc.com/Linux/2014-06/102866p2.htm

三、编程模型

MapReduce 编程模型的原理是:利用一个输入 key/value pair 集合来产生一个输出的 key/value pair 集合。MapReduce 库的用户用两个函数表达这个计算:Map 和 Reduce。

用户自定义的 Map 函数接受一个输入的 key/value pair 值,然后产生一个中间 key/value pair 值的集合。MapReduce 库把所有具有相同中间 key 值 I 的中间 value 值集合在一起后传递给 reduce 函数。

用户自定义的 Reduce 函数接受一个中间 key 的值 I 和相关的一个 value 值的集合。Reduce 函数合并这些 value 值,形成一个较小的 value 值的集合。一般的,每次 Reduce 函数调用只产生 0 或 1 个输出 value 值。通常我们通过一个迭代器把中间 value 值提供给 Reduce 函 数,这样我们就可以处理无法全部放入内存中的大量的 value 值的集合。

四、小例子

1. 数据准备

以 Tomcat 日志为例,日志格式如下:

127.0.0.1,-,-,[08/May/2014:13:42:40 +0800],GET / HTTP/1.1,200,11444
127.0.0.1,-,-,[08/May/2014:13:42:42 +0800],GET /jygl/jaxrs/teachingManage/ClassBatchPlanService/getCurrentClassPlanVO HTTP/1.1,204,-
127.0.0.1,-,-,[08/May/2014:13:42:42 +0800],GET /jygl/jaxrs/teachingManage/ClassBatchPlanService/getCurClassPlanVO HTTP/1.1,204,-
127.0.0.1,-,-,[08/May/2014:13:42:47 +0800],GET /jygl/jaxrs/right/isValidUserByType/1-admin-superadmin HTTP/1.1,200,20
127.0.0.1,-,-,[08/May/2014:13:42:47 +0800],GET /jygl/jaxrs/right/getUserByLoginName/admin HTTP/1.1,200,198
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/right_login2home?loginName=admin&password=superadmin&type=1 HTTP/1.1,200,2525
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/style/style.css HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/js/tree.js HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/js/jquery.js HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/js/frame.js HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/images/logo.png HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/images/leftmenu_bg.gif HTTP/1.1,404,1105
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/menuList.jsp HTTP/1.1,200,47603
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/style/images/header_bg.jpg HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/images/allmenu.gif HTTP/1.1,404,1093
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:47 +0800],GET /jyglFront/mainView/navigate/images/toggle_menu.gif HTTP/1.1,404,1105
127.0.0.1,-,-,[08/May/2014:13:42:48 +0800],GET /jygl/jaxrs/article/getArticleList/10-1 HTTP/1.1,200,20913
127.0.0.1,-,-,[08/May/2014:13:42:48 +0800],GET /jygl/jaxrs/article/getTotalArticleRecords HTTP/1.1,200,22
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:48 +0800],GET /jyglFront/baseInfo_articleList?flag=1 HTTP/1.1,200,8989
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:42:48 +0800],GET /jyglFront/mainView/studentView/style/images/nav_10.png HTTP/1.1,404,1117
127.0.0.1,-,-,[08/May/2014:13:43:21 +0800],GET /jygl/jaxrs/right/isValidUserByType/1-admin-superadmin HTTP/1.1,200,20
127.0.0.1,-,-,[08/May/2014:13:43:21 +0800],GET /jygl/jaxrs/right/getUserByLoginName/admin HTTP/1.1,200,198
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/right_login2home?loginName=admin&password=superadmin&type=1 HTTP/1.1,200,2525
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/js/tree.js HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/js/jquery.js HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/js/frame.js HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/style/style.css HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/menuList.jsp HTTP/1.1,200,47603
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/images/logo.png HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/images/leftmenu_bg.gif HTTP/1.1,404,1105
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/images/toggle_menu.gif HTTP/1.1,404,1105
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/style/images/header_bg.jpg HTTP/1.1,304,-
127.0.0.1,-,-,[08/May/2014:13:43:21 +0800],GET /jygl/jaxrs/article/getArticleList/10-1 HTTP/1.1,200,20913
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/navigate/images/allmenu.gif HTTP/1.1,404,1093
127.0.0.1,-,-,[08/May/2014:13:43:21 +0800],GET /jygl/jaxrs/article/getTotalArticleRecords HTTP/1.1,200,22
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/baseInfo_articleList?flag=1 HTTP/1.1,200,8989
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:21 +0800],GET /jyglFront/mainView/studentView/style/images/nav_10.png HTTP/1.1,404,1117
127.0.0.1,-,-,[08/May/2014:13:43:25 +0800],GET /jygl/jaxrs/graduate/graduateBatchService/getGraduateBatchByConditions?graduateBatchName=&pageSize=10&pageNo=1 HTTP/1.1,200,597
127.0.0.1,-,-,[08/May/2014:13:43:25 +0800],GET /jygl/jaxrs/graduate/graduateBatchService/getTotalGraduateBatchByCondition?graduateBatchName= HTTP/1.1,200,21
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:26 +0800],GET /jyglFront/graduate_initGraduateBatch HTTP/1.1,200,8766
127.0.0.1,-,-,[08/May/2014:13:43:27 +0800],GET /jygl/jaxrs/exam/examParameterService/getAllStudyCenters HTTP/1.1,200,29089
127.0.0.1,-,-,[08/May/2014:13:43:27 +0800],GET /jygl/jaxrs/exam/examParameterService/getAllGradeInfo HTTP/1.1,200,3785
127.0.0.1,-,-,[08/May/2014:13:43:27 +0800],GET /jygl/jaxrs/enroll/educationLevelService/allEducationLevels HTTP/1.1,200,227
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:28 +0800],GET /jyglFront/graduate_initGraduateQulifyCheck HTTP/1.1,200,26397
127.0.0.1,-,-,[08/May/2014:13:43:29 +0800],GET /jygl/jaxrs/exam/examParameterService/getAllStudyCenters HTTP/1.1,200,29089
127.0.0.1,-,-,[08/May/2014:13:43:29 +0800],GET /jygl/jaxrs/exam/examParameterService/getAllGradeInfo HTTP/1.1,200,3785
127.0.0.1,-,-,[08/May/2014:13:43:29 +0800],GET /jygl/jaxrs/enroll/educationLevelService/allEducationLevels HTTP/1.1,200,227
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:29 +0800],GET /jyglFront/graduate_initLeaveSchoolInfo HTTP/1.1,200,20125
127.0.0.1,-,-,[08/May/2014:13:43:30 +0800],GET /jygl/jaxrs/exam/examParameterService/getAllStudyCenters HTTP/1.1,200,29089
127.0.0.1,-,-,[08/May/2014:13:43:31 +0800],GET /jygl/jaxrs/exam/examParameterService/getAllGradeInfo HTTP/1.1,200,3785
127.0.0.1,-,-,[08/May/2014:13:43:31 +0800],GET /jygl/jaxrs/enroll/educationLevelService/allEducationLevels HTTP/1.1,200,227
127.0.0.1,-,-,[08/May/2014:13:43:31 +0800],GET /jygl/jaxrs/graduate/graduateBatchService/getAllGraduateBatch HTTP/1.1,200,597
0:0:0:0:0:0:0:1,-,-,[08/May/2014:13:43:31 +0800],GET /jyglFront/graduate_initGraduateInfo HTTP/1.1,200,28464
127.0.0.1,-,-,[08/May/2014:14:27:10 +0800],GET / HTTP/1.1,200,11444
127.0.0.1,-,-,[08/May/2014:14:27:12 +0800],GET /jygl/jaxrs/teachingManage/ClassBatchPlanService/getCurrentClassPlanVO HTTP/1.1,204,-
127.0.0.1,-,-,[08/May/2014:14:27:12 +0800],GET /jygl/jaxrs/teachingManage/ClassBatchPlanService/getCurClassPlanVO HTTP/1.1,204,-
127.0.0.1,-,-,[08/May/2014:14:27:34 +0800],GET /jygl/jaxrs/exam/examArrangeService/getExamBatchIdByLatest HTTP/1.1,200,43
127.0.0.1,-,-,[08/May/2014:14:27:34 +0800],GET /jygl/jaxrs/exam/examArrangeService/getExamBatchNameByEBId/4af2a0424323412e014327739b1702bd HTTP/1.1,200,16
127.0.0.1,-,-,[08/May/2014:14:27:35 +0800],GET /jygl/jaxrs/exam/examSubscribeService/getUtilObjectThirExamBatchsByEBNN/201403 HTTP/1.1,200,653
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:35 +0800],GET /jyglFront/exam_initgroupsubscribestatistic HTTP/1.1,200,13551
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:37 +0800],GET /jyglFront/exam_initsubstudentsubscribe HTTP/1.1,500,3900
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:41 +0800],GET /jyglFront/supervisor/intoInitAssignmentDetail HTTP/1.1,200,1808
127.0.0.1,-,-,[08/May/2014:14:27:42 +0800],GET /jygl/jaxrs/right/isValidUserByType/1-admin-superadmin HTTP/1.1,200,20
127.0.0.1,-,-,[08/May/2014:14:27:42 +0800],GET /jygl/jaxrs/right/getUserByLoginName/admin HTTP/1.1,200,198
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/right_login2home?loginName=admin&password=superadmin&type=1 HTTP/1.1,200,2525
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/js/tree.js HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/style/style.css HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/js/frame.js HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/js/jquery.js HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/menuList.jsp HTTP/1.1,200,47603
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/images/leftmenu_bg.gif HTTP/1.1,404,1105
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/images/allmenu.gif HTTP/1.1,404,1093
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/images/logo.png HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/style/images/header_bg.jpg HTTP/1.1,304,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/mainView/navigate/images/toggle_menu.gif HTTP/1.1,404,1105
127.0.0.1,-,-,[08/May/2014:14:27:42 +0800],GET /jygl/jaxrs/article/getArticleList/10-1 HTTP/1.1,200,20913
127.0.0.1,-,-,[08/May/2014:14:27:42 +0800],GET /jygl/jaxrs/article/getTotalArticleRecords HTTP/1.1,200,22
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:42 +0800],GET /jyglFront/baseInfo_articleList?flag=1 HTTP/1.1,200,8989
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:43 +0800],GET /jyglFront/mainView/studentView/style/images/nav_10.png HTTP/1.1,404,1117
127.0.0.1,-,-,[08/May/2014:14:27:44 +0800],GET /jygl/jaxrs/nationInfo/getAllNationInPage?pageSize=10&pageNo=1 HTTP/1.1,200,374
127.0.0.1,-,-,[08/May/2014:14:27:44 +0800],GET /jygl/jaxrs/nationInfo/getTotalNations HTTP/1.1,200,22
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:44 +0800],GET /jyglFront/baseInfo_nationInfoList HTTP/1.1,200,7471
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:44 +0800],GET /jyglFront/common/css/menuStyle2.css HTTP/1.1,404,1060
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:44 +0800],GET /jyglFront/common/css/basic.css HTTP/1.1,200,1476
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:45 +0800],GET /jyglFront/common/css/_images/botton2.gif HTTP/1.1,404,1075
127.0.0.1,-,-,[08/May/2014:14:27:47 +0800],GET /jygl/jaxrs/enroll/educationLevelService/allEducationLevels HTTP/1.1,200,227
127.0.0.1,-,-,[08/May/2014:14:27:47 +0800],GET /jygl/jaxrs/enroll/gradeInfoService/allGradeInfos HTTP/1.1,200,3785
127.0.0.1,-,-,[08/May/2014:14:27:47 +0800],GET /jygl/jaxrs/teaching/teachingPlanService/getSpeicalListByTwo?gradeID=&educationLevelID= HTTP/1.1,200,12061
127.0.0.1,-,-,[08/May/2014:14:27:47 +0800],GET /jygl/jaxrs/enroll/studyCenterService/allStudyCentersByUtilObject HTTP/1.1,200,6006
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:48 +0800],GET /jyglFront/teaching/openReplaceChooseCourse HTTP/1.1,200,26455
127.0.0.1,-,-,[08/May/2014:14:27:49 +0800],GET /jygl/jaxrs/teachingManage/ClassBatchPlanService/getCurClassBatchPlanVOList?newClassBatchName=&gradeName=&term=-1 HTTP/1.1,204,-
127.0.0.1,-,-,[08/May/2014:14:27:49 +0800],GET /jygl/jaxrs/teachingManage/ClassBatchPlanService/getCurClassBatchPlanVOList?newClassBatchName=&gradeName=&term=-1 HTTP/1.1,204,-
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:49 +0800],GET /jyglFront/teaching/openChooseCourse HTTP/1.1,200,1611
127.0.0.1,-,-,[08/May/2014:14:27:51 +0800],GET /jygl/jaxrs/enroll/gradeInfoService/currentGradeInfo HTTP/1.1,200,473
127.0.0.1,-,-,[08/May/2014:14:27:51 +0800],GET /jygl/jaxrs/enroll/educationLevelService/allEducationLevels HTTP/1.1,200,227
127.0.0.1,-,-,[08/May/2014:14:27:51 +0800],GET /jygl/jaxrs/enroll/gradeInfoService/allGradeInfos HTTP/1.1,200,3785
127.0.0.1,-,-,[08/May/2014:14:27:51 +0800],GET /jygl/jaxrs/teaching/teachingPlanService/hasTeachingPlanInGrade?gradeId=4af2a042437c2c0801437ed1cdea0017 HTTP/1.1,200,20
127.0.0.1,-,-,[08/May/2014:14:27:51 +0800],GET /jygl/jaxrs/teaching/teachingPlanService/hasTeachingPlanInGrade?gradeId=4af2a0423f41d66d013f5a1f766c00ce HTTP/1.1,200,20
127.0.0.1,-,-,[08/May/2014:14:27:51 +0800],GET /jygl/jaxrs/teaching/teachingPlanService/teachingPlanListByEducationLevelAndGradeId?grade=4af2a042437c2c0801437ed1cdea0017&educationLevel= HTTP/1.1,200,4849
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:52 +0800],GET /jyglFront/teaching/teachingPlanList HTTP/1.1,200,22794
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:27:52 +0800],GET /jyglFront/js/jquery.form.js HTTP/1.1,200,30330
127.0.0.1,-,-,[08/May/2014:14:28:02 +0800],GET /jygl/jaxrs/exam/examArrangeService/getExamBatchIdByLatest HTTP/1.1,200,43
127.0.0.1,-,-,[08/May/2014:14:28:02 +0800],GET /jygl/jaxrs/exam/examArrangeService/getExamBatchNameByEBId/4af2a0424323412e014327739b1702bd HTTP/1.1,200,16
127.0.0.1,-,-,[08/May/2014:14:28:02 +0800],GET /jygl/jaxrs/exam/examSubscribeService/getUtilObjectThirExamBatchsByEBNN/201403 HTTP/1.1,200,653
0:0:0:0:0:0:0:1,-,-,[08/May/2014:14:28:02 +0800],GET /jyglFront/exam_initgroupsubscribestatistic HTTP/1.1,200,13551
127.0.0.1,-,-,[08/May/2014:14:28:19 +0800],POST /jygl/jaxrs/right/addUserLog HTTP/1.1,200,-
127.0.0.1,-,-,[08/May/2014:14:31:42 +0800],GET /jygl/jaxrs/exam/examSubscribeService/groupSubscribe/201403/0/0/201309/1 HTTP/1.1,200,-

2. 要解决的问题:统计资源(URL)被访问的次数。

3. 编程实现

想法:解析 Tomcat 日志,map 的工作是将每一行日志中的 URL 截取作为 key 值,value 为 1 表示 1 次,reduce 的工作是将相同 key 值的行合并,value 为总次数。

代码如下:

package org.ly.ccnu;
import java.io.IOException;
import org.apache.Hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class SecondTest extends Configured implements Tool{
 enum Counter{
  LINESKIP,
 } 
 public static class Map extends Mapper<LongWritable,Text,Text,IntWritable>{
  private static final IntWritable one = new IntWritable(1);
  public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
   String line = value.toString();
   try{
    String[] lineSplit = line.split(“,”);
    String requestUrl = lineSplit[4];
    requestUrl = requestUrl.substring(requestUrl.indexOf(‘ ‘)+1, requestUrl.lastIndexOf(‘ ‘));
    Text out = new Text(requestUrl);
    context.write(out,one);
   }catch(java.lang.ArrayIndexOutOfBoundsException e){
    context.getCounter(Counter.LINESKIP).increment(1);
   }   
  }
 } 
 public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable>{
  public void reduce(Text key, Iterable<IntWritable> values,Context context)throws IOException{
   int count =  0; 
            for(IntWritable v : values){
                count = count + 1; 
            } 
            try {
    context.write(key, new IntWritable(count));
   } catch (InterruptedException e) {
    e.printStackTrace();
   }    
  } 
 }  
 @Override
 public int run(String[] args) throws Exception {
  Configuration conf = getConf();
  Job job = new Job(conf, “logAnalysis”);
  job.setJarByClass(SecondTest.class); 
  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1])); 
  job.setMapperClass(Map.class);
  job.setReducerClass(Reduce.class);
  job.setOutputFormatClass(TextOutputFormat.class); 
  //keep the same format with the output of Map and Reduce
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class); 
  job.waitForCompletion(true);
  return job.isSuccessful()?0:1;
 } 
 public static void main(String[] args)throws Exception{
  int res = ToolRunner.run(new Configuration(), new SecondTest(),args); 
  System.exit(res);
 }
}

4. 处理结果

/ 2
/jygl/jaxrs/article/getArticleList/10-1 3
/jygl/jaxrs/article/getTotalArticleRecords 3
/jygl/jaxrs/enroll/educationLevelService/allEducationLevels 5
/jygl/jaxrs/enroll/gradeInfoService/allGradeInfos 2
/jygl/jaxrs/enroll/gradeInfoService/currentGradeInfo 1
/jygl/jaxrs/enroll/studyCenterService/allStudyCentersByUtilObject 1
/jygl/jaxrs/exam/examArrangeService/getExamBatchIdByLatest 2
/jygl/jaxrs/exam/examArrangeService/getExamBatchNameByEBId/4af2a0424323412e014327739b1702bd 2
/jygl/jaxrs/exam/examParameterService/getAllGradeInfo 3
/jygl/jaxrs/exam/examParameterService/getAllStudyCenters 3
/jygl/jaxrs/exam/examSubscribeService/getUtilObjectThirExamBatchsByEBNN/201403 2
/jygl/jaxrs/exam/examSubscribeService/groupSubscribe/201403/0/0/201309/1 1
/jygl/jaxrs/graduate/graduateBatchService/getAllGraduateBatch 1
/jygl/jaxrs/graduate/graduateBatchService/getGraduateBatchByConditions?graduateBatchName=&pageSize=10&pageNo=1 1
/jygl/jaxrs/graduate/graduateBatchService/getTotalGraduateBatchByCondition?graduateBatchName= 1
/jygl/jaxrs/nationInfo/getAllNationInPage?pageSize=10&pageNo=1 1
/jygl/jaxrs/nationInfo/getTotalNations 1
/jygl/jaxrs/right/addUserLog 1
/jygl/jaxrs/right/getUserByLoginName/admin 3
/jygl/jaxrs/right/isValidUserByType/1-admin-superadmin 3
/jygl/jaxrs/teaching/teachingPlanService/getSpeicalListByTwo?gradeID=&educationLevelID= 1
/jygl/jaxrs/teaching/teachingPlanService/hasTeachingPlanInGrade?gradeId=4af2a0423f41d66d013f5a1f766c00ce 1
/jygl/jaxrs/teaching/teachingPlanService/hasTeachingPlanInGrade?gradeId=4af2a042437c2c0801437ed1cdea0017 1
/jygl/jaxrs/teaching/teachingPlanService/teachingPlanListByEducationLevelAndGradeId?grade=4af2a042437c2c0801437ed1cdea0017&educationLevel= 1
/jygl/jaxrs/teachingManage/ClassBatchPlanService/getCurClassBatchPlanVOList?newClassBatchName=&gradeName=&term=-1 2
/jygl/jaxrs/teachingManage/ClassBatchPlanService/getCurClassPlanVO 2
/jygl/jaxrs/teachingManage/ClassBatchPlanService/getCurrentClassPlanVO 2
/jyglFront/baseInfo_articleList?flag=1 3
/jyglFront/baseInfo_nationInfoList 1
/jyglFront/common/css/_images/botton2.gif 1
/jyglFront/common/css/basic.css 1
/jyglFront/common/css/menuStyle2.css 1
/jyglFront/exam_initgroupsubscribestatistic 2
/jyglFront/exam_initsubstudentsubscribe 1
/jyglFront/graduate_initGraduateBatch 1
/jyglFront/graduate_initGraduateInfo 1
/jyglFront/graduate_initGraduateQulifyCheck 1
/jyglFront/graduate_initLeaveSchoolInfo 1
/jyglFront/js/jquery.form.js 1
/jyglFront/mainView/navigate/images/allmenu.gif 3
/jyglFront/mainView/navigate/images/leftmenu_bg.gif 3
/jyglFront/mainView/navigate/images/logo.png 3
/jyglFront/mainView/navigate/images/toggle_menu.gif 3
/jyglFront/mainView/navigate/js/frame.js 3
/jyglFront/mainView/navigate/js/jquery.js 3
/jyglFront/mainView/navigate/js/tree.js 3
/jyglFront/mainView/navigate/menuList.jsp 3
/jyglFront/mainView/navigate/style/images/header_bg.jpg 3
/jyglFront/mainView/navigate/style/style.css 3
/jyglFront/mainView/studentView/style/images/nav_10.png 3
/jyglFront/right_login2home?loginName=admin&password=superadmin&type=1 3
/jyglFront/supervisor/intoInitAssignmentDetail 1
/jyglFront/teaching/openChooseCourse 1
/jyglFront/teaching/openReplaceChooseCourse 1
/jyglFront/teaching/teachingPlanList 1

更多 Hadoop 相关信息见 Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=13

一、概念知识介绍

Hadoop MapReduce 是一个用于处理海量数据的分布式计算框架。这个框架解决了诸如数据分布式存储、作业调度、容错、机器间通信等复杂问题,可以使没有并行 处理或者分布式计算经验的工程师,也能很轻松地写出结构简单的、应用于成百上千台机器处理大规模数据的并行分布式程序。

Hadoop MapReduce 基于“分而治之”的思想,将计算任务抽象成 map 和 reduce 两个计算过程,可以简单理解为“分散运算—归并结果”的过程。一个 MapReduce 程序首先会把输入数据分割成不相关的若干键 / 值对(key1/value1)集合,这些键 / 值对会由多个 map 任务来并行地处理。MapReduce 会对 map 的输出(一些中间键 / 值对 key2/value2 集合)按照 key2 进行排序,排序是用 memcmp 的方式对 key 在内存中 字节数组比较后进行升序排序,并将属于同一个 key2 的所有 value2 组合在一起作为 reduce 任务的输入,由 reduce 任务计算出最终结果并输出 key3/value3。作为一个优化,同一个计算节点上的 key2/value2 会通过 combine 在本地归并。基本流程如下:

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

 

Hadoop 和单机程序计算流程对比:

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

 

常计算任务的输入和输出都是存放在文件里的,并且这些文件被存放在 Hadoop 分布式文件系统 HDFS(Hadoop Distributed File System)中,系统会尽量调度计算任务到数据所在的节点上运行,而不是尽量将数据移动到计算节点上,减少大量数据在网络中传输,尽量节省带宽消耗。

应用程序开发人员一般情况下需要关心的是图中灰色的部分,单机程序需要处理数据读取和写入、数据处理;Hadoop 程序需要实现 map 和 reduce,而数据读取和写入、map 和 reduce 之间的数据传输、容错处理等由 Hadoop MapReduce 和 HDFS 自动完成。

二、开发环境搭建

Map/Reduce 程序依赖 Hadoop 集群,另外 Eclipse 需要安装依赖的 hadoop 包。

Hadoop 集群搭建:参考 Hadoop 2.2.0 集群搭建 http://www.linuxidc.com/Linux/2014-06/102865.htm

————————————– 分割线 ————————————–

Ubuntu 13.04 上搭建 Hadoop 环境 http://www.linuxidc.com/Linux/2013-06/86106.htm

Ubuntu 12.10 +Hadoop 1.2.1 版本集群配置 http://www.linuxidc.com/Linux/2013-09/90600.htm

Ubuntu 上搭建 Hadoop 环境(单机模式 + 伪分布模式)http://www.linuxidc.com/Linux/2013-01/77681.htm

Ubuntu 下 Hadoop 环境的配置 http://www.linuxidc.com/Linux/2012-11/74539.htm

单机版搭建 Hadoop 环境图文教程详解 http://www.linuxidc.com/Linux/2012-02/53927.htm

Hadoop LZO 安装教程 http://www.linuxidc.com/Linux/2013-01/78397.htm

Hadoop 集群上使用 Lzo 压缩 http://www.linuxidc.com/Linux/2012-05/60554.htm

 

1. 安装、配置 Eclipse

在官网下载合适的 Eclipse,将 hadoop 开发所依赖的插件 jar 包拷贝到 eclipse 的安装文件夹 plugins 下。Hadoop2.2.0 开发依赖的 jar 包下载地址参考:

—————————————— 分割线 ——————————————

FTP 地址:ftp://ftp1.linuxidc.com

用户名:ftp1.linuxidc.com

密码:www.linuxidc.com

在 2014 年 LinuxIDC.com\6 月 \Hadoop 学习:Map&Reduce 初探与小 Demo 实现

下载方法见 http://www.linuxidc.com/Linux/2013-10/91140.htm

—————————————— 分割线 ——————————————

当然也可以自己编译。

启动 eclipse,选择 Window—>Prefrances,若出现如下 Hadoop Map/Reduce 说明插件安装成功

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

2. 配置 DFS,主要是数据文件的输入输出管理。

Window—>Open Perspective—>other—>Map/Reduce,显示 Map/Reduce 视图。点击 Map/Reduce Locations 的小象图标,新建 Hadoop Location,输入如下:

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

项目视图会出现 DFS Location,用来管理输入、输出数据文件。

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

需要配置 hadoop 安装文件夹:新建 Map/Reduce 工程单击 Configure Hadoop install direction,输入 hadoop 的安装路径。

Hadoop 学习:Map/Reduce 初探与小 Demo 实现

右键单击 DFS Location 下的空文件夹上传一个文本文件,然后刷新,若文件出现了则说明环境配置成功。

更多详情见请继续阅读下一页的精彩内容 :http://www.linuxidc.com/Linux/2014-06/102866p2.htm

正文完
星哥说事-微信公众号
post-qrcode
 0
星锅
版权声明:本站原创文章,由 星锅 于2022-01-20发表,共计23445字。
转载说明:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
【腾讯云】推广者专属福利,新客户无门槛领取总价值高达2860元代金券,每种代金券限量500张,先到先得。
阿里云-最新活动爆款每日限量供应
评论(没有评论)
验证码
【腾讯云】云服务器、云数据库、COS、CDN、短信等云产品特惠热卖中