阿里云-云小站(无限量代金券发放中)
【腾讯云】云服务器、云数据库、COS、CDN、短信等热卖云产品特惠抢购

用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0

153次阅读
没有评论

共计 4530 个字符,预计需要花费 12 分钟才能阅读完成。

引言
接前一篇文章《Mahout0.9 打 patch 使其支持 Hadoop2.2.0》http://www.linuxidc.com/Linux/2014-09/106286.htm
为 Mahout0.9 打过 Patch 编译成功后,使用贝叶斯文本分类来测试 Mahout0.9 对 Hadoop2.2.0 的兼容性。
 
步骤一:将 20news 的文件都上传到 hdfs
yarn@singletest:~/Mahout/mahout-distribution-0.7$ hadoop fs -ls /workspace/mahout/week4/data/20news
Found 2 items
drwxr-xr-x   – yarn supergroup          0 2014-09-04 21:52 /workspace/mahout/week4/data/20news/20news-bydate-test
drwxr-xr-x   – yarn supergroup          0 2014-09-04 21:57 /workspace/mahout/week4/data/20news/20news-bydate-train

步骤二:对数据创建 序列文件
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seqdirectory -i /workspace/mahout/week4/data/20news -o /workspace/mahout/week4/data/20news_seq
 
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/20news_seq
Found 1 items
-rw-r–r–   1 yarn supergroup   37064977 2014-09-04 22:12 /workspace/mahout/week4/data/20news_seq/chunk-0

第三步:将 序列文件 转化成 向量
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seq2sparse -i /workspace/mahout/week4/data/20news_seq/ -o /workspace/mahout/week4/data/20news_vectors -lnorm -nv -wt tfidf
 
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/20news_vectors
Found 7 items
drwxr-xr-x   – yarn supergroup          0 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/df-count
-rw-r–r–   1 yarn supergroup    1937084 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/dictionary.file-0
-rw-r–r–   1 yarn supergroup    1890053 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/frequency.file-0
drwxr-xr-x   – yarn supergroup          0 2014-09-04 22:19 /workspace/mahout/week4/data/20news_vectors/tf-vectors
drwxr-xr-x   – yarn supergroup          0 2014-09-04 22:21 /workspace/mahout/week4/data/20news_vectors/tfidf-vectors
drwxr-xr-x   – yarn supergroup          0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/tokenized-documents
drwxr-xr-x   – yarn supergroup          0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/wordcount

第四步:将 向量集 分为 训练集 测试数据
参数:
  • -tr 训练集    
  • -te 测试集
  • -rp 参数设定的是测试数据集占总数据集的百分比,以下代码设定为 20%!   
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout split -i /workspace/mahout/week4/data/20news_vectors/tfidf-vectors -tr /workspace/mahout/week4/data/train-vectors -te /workspace/mahout/week4/data/test-vectors -rp 20 -ow -seq -xm sequential

第五步:训练模型
yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout trainnb -i /workspace/mahout/week4/data/train-vectors -el -o /workspace/mahout/week4/nbmodel -li /workspace/mahout/week4/labindex -ow -c
 
查看生成的索引:
yarn@singletest:~$ hadoop fs -text /workspace/mahout/week4/labindex
20news-bydate-test      0
20news-bydate-train     1
 
查看训练出来的模型:
yarn@singletest:~$ hadoop fs -ls /workspace/mahout/week4/nbmodel
Found 1 items
-rw-r–r–   1 yarn supergroup    2437874 2014-09-05 23:09 /workspace/mahout/week4/nbmodel/naiveBayesModel.bin

第六步:测试
yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout testnb -i /workspace/mahout/week4/data/test-vectors -m /workspace/mahout/week4/nbmodel -l /workspace/mahout/week4/labindex -ow -o /workspace/mahout/week4/20news-test-result -c
 
注意:测试时的 - i 跟着的输入路径是第四步拆分出来的测试集。
 
测试结果:
14/09/05 23:18:09 INFO test.TestNaiveBayesDriver: Complementary Results:
=======================================================
Summary
——————————————————-
Correctly Classified Instances          :       2887       74.9675%
Incorrectly Classified Instances        :        964       25.0325%
Total Classified Instances              :       3851
 
=======================================================
Confusion Matrix
——————————————————-
a       b       <–Classified as
1131    413      |  1544        a     = 20news-bydate-test
551     1756     |  2307        b     = 20news-bydate-train
 
=======================================================
Statistics
——————————————————-
Kappa                                        0.486
Accuracy                                   74.9675%
Reliability                                49.7892%
Reliability (standard deviation)            0.4314
 
14/09/05 23:18:09 INFO driver.MahoutDriver: Program took 17504 ms (Minutes: 0.29173333333333334)

===============================================

Ubuntu 13.04 上搭建 Hadoop 环境 http://www.linuxidc.com/Linux/2013-06/86106.htm

Ubuntu 12.10 +Hadoop 1.2.1 版本集群配置 http://www.linuxidc.com/Linux/2013-09/90600.htm

Ubuntu 上搭建 Hadoop 环境(单机模式 + 伪分布模式)http://www.linuxidc.com/Linux/2013-01/77681.htm

Ubuntu 下 Hadoop 环境的配置 http://www.linuxidc.com/Linux/2012-11/74539.htm

单机版搭建 Hadoop 环境图文教程详解 http://www.linuxidc.com/Linux/2012-02/53927.htm

搭建 Hadoop 环境(在 Winodws 环境下用虚拟机虚拟两个 Ubuntu 系统进行搭建)http://www.linuxidc.com/Linux/2011-12/48894.htm

===============================================

更多 Hadoop 相关信息见Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=13

正文完
星哥说事-微信公众号
post-qrcode
 
星锅
版权声明:本站原创文章,由 星锅 2022-01-20发表,共计4530字。
转载说明:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
【腾讯云】推广者专属福利,新客户无门槛领取总价值高达2860元代金券,每种代金券限量500张,先到先得。
阿里云-最新活动爆款每日限量供应
评论(没有评论)
验证码
【腾讯云】云服务器、云数据库、COS、CDN、短信等云产品特惠热卖中