Apache Hive 无法收集stats问题

135次阅读

共计 3160 个字符，预计需要花费 8 分钟才能阅读完成。

环境：
hive: apache-hive-1.1.0
Hadoop:hadoop-2.5.0-cdh5.3.2
hive 元数据以及 stats 使用 mysql 进行存储。
hive stats 相关参数如下：
hive.stats.autogather：在 insert overwrite 命令时自动收集统计信息，默认开启 true；设置为 true
hive.stats.dbclass：存储 hive 临时统计信息的数据库，默认是 jdbc:derby；设置为 jdbc:mysql
hive.stats.jdbcdriver：数据库临时存储 hive 统计信息的 jdbc 驱动；设置为 com.mysql.jdbc.driver
hive.stats.dbconnectionstring：临时统计信息数据库连接串，默认 jdbc:derby:databaseName=TempStatsStore;create=true；设置为 jdbc:mysql://[ip:port]/[dbname]?user=[username]&password=

此处含有隐藏内容，需要正确输入密码后可见！

hive.stats.defaults.publisher：如果 dbclass 不是 jdbc 或者 hbase，那么使用这个作为默认发布，必须实现 StatsPublisher 接口，默认是空；保留默认
hive.stats.defaults.aggregator：如果 dbclass 不是 jdbc 或者 hbase，那么使用该类做聚集，要求实现 StatsIAggregator 接口，默认是空；保留默认
hive.stats.jdbc.timeout：jdbc 连接超时配置，默认 30 秒；保留默认
hive.stats.retries.max：当统计发布合聚集在更新数据库时出现异常时最大的重试次数，默认是 0，不重试；保留默认
hive.stats.retries.wait：重试次数之间的等待窗口，默认是 3000 毫秒；保留默认
hive.client.stats.publishers：做 count 的 job 的统计发布类列表，由逗号隔开，默认是空；必须实现 org.apache.hadoop.hive.ql.stats.ClientStatsPublisher 接口；保留默认
现象：

执行 insert overwrite table 没有正确的返回 numRows 和 rawDataSize; 结果类似如下
[numFiles=1, numRows=0, totalSize=59, rawDataSize=0]
在 hive stats mysql 数据库也没有任何相关的 stats 插入进来。
先定位问题是 hive stats 出现问题，由于 console 打印出来的信息过少，无法精确定位问题；因此设置
hive –hiveconf hive.root.logger=INFO,console；将详细日志打印出来, 发现以下信息：
[Error 30001]: StatsPublisher cannot be initialized. There was a error in the initialization
of StatsPublisher, and retrying might help. If you dont want the query to fail because accurate
statistics could not be collected, set hive.stats.reliable=falseSpecified key was too long; max key length is 767 bytes
这个问题比较简单，是由于 hive1.1.0,ID column 长度默认为 4000；而且设置 ID 为主键，导致报错
org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsSetupConstants
12 // MySQL – 65535, SQL Server – 8000, Oracle – 4000, Derby – 32762, Postgres – large.
public static final int ID_COLUMN_VARCHAR_SIZE = 4000;

org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher：public boolean init(Configuration hconf)
if (colSize < JDBCStatsSetupConstants.ID_COLUMN_VARCHAR_SIZE) {
String alterTable = JDBCStatsUtils.getAlterIdColumn();
stmt.executeUpdate(alterTable);
}

从这个代码知道，如果表的 ID column size 小于 4000，会被自动改为 4000；因此只有修改源码将 4000->255（mysql 采用 utf8 编码，一个 utf8 占用 3 个字节，因此 255*3=765<767）; 并且对于目前集群来说 255 字节已经够用。

public static final int ID_COLUMN_VARCHAR_SIZE = 255;

重新编译，打包推送到测试环境，经过测试发现问题还是存在。
[numFiles=1, numRows=0, totalSize=59, rawDataSize=0]
hive –hiveconf hive.root.logger=INFO,console；将详细日志打印出来

并没有发现有异常发生。
为了跟踪问题，set hive.stats.reliable=true；
重新执行命令，这次报错，查看 job 报错信息，发现问题出现在
org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsAggregator
try {
Class.forName(driver).newInstance();
} catch (Exception e) {
LOG.error(“Error during instantiating JDBC driver ” + driver + “. “, e);
return false;
}

这个是在 yarn 上运行，无法找到 com.mysql.jdbc.Driver 这个类导致，将 mysql 驱动包，放置于 yarn/lib/ 目录下面，全集群推送，重跑测试脚本，发现问题解决。

Hive 编程指南 PDF 中文高清版 http://www.linuxidc.com/Linux/2015-01/111837.htm

基于 Hadoop 集群的 Hive 安装 http://www.linuxidc.com/Linux/2013-07/87952.htm

Hive 内表和外表的区别 http://www.linuxidc.com/Linux/2013-07/87313.htm

Hadoop + Hive + Map +reduce 集群安装部署 http://www.linuxidc.com/Linux/2013-07/86959.htm

Hive 本地独立模式安装 http://www.linuxidc.com/Linux/2013-06/86104.htm

Hive 学习之 WordCount 单词统计 http://www.linuxidc.com/Linux/2013-04/82874.htm

Hive 运行架构及配置部署 http://www.linuxidc.com/Linux/2014-08/105508.htm

Hive 的详细介绍：请点这里
Hive 的下载地址：请点这里

本文永久更新链接地址：http://www.linuxidc.com/Linux/2015-05/117294.htm

正文完

星哥说事-微信公众号