阿里云-云小站(无限量代金券发放中)
【腾讯云】云服务器、云数据库、COS、CDN、短信等热卖云产品特惠抢购

重新编译、安装spark assembly,使CDH5.5.1支持sparkSQL

145次阅读
没有评论

共计 7010 个字符,预计需要花费 18 分钟才能阅读完成。

CDH 内嵌 spark 版本不支持 spark-sql,sparkR, 如果要使用,需要将 hive 的相关依赖包打进 spark assembly jar 中,下面就是针对 spark-sql 的编译、安装步骤

一. 在任意一台 linux 机器上准备编译环境

spark-1.5.0.tgz 下载地址:https://spark.apache.org/downloads.html

jdk1.7.0_79

scala2.10.4

maven3.3.9

版本都是 spark 官网要求如下,详情可参考:https://spark.apache.org/docs/

Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.5.0 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).

Building Spark using Maven requires Maven 3.3.3 or newer and Java 7+. The Spark build can supply a suitable Maven binary;

配置环境变量如下,并使其生效:source /etc/profile

export JAVA_HOME=/data/jdk1.7.0_79
export M2_HOME=/data/apache-maven-3.3.9
export SCALA_HOME=/data/scala-2.10.4
export PATH=$JAVA_HOME/bin:$M2_HOME/bin:$SCALA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

二. 编译步骤

更多详情点击查看官网:https://spark.apache.org/docs/1.5.0/building-spark.html

1. 重新设置 maven 编译所占空间,因为编译过程复杂、时间长

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

2. 解压 spark-1.5.0.tgz(例如到 /data 目录下), 执行 nohup mvn 命令开始后台编译,结果输出到日志文件 )

nohup mvn -Pyarn -PHadoop-2.6 -Dhadoop.version=hadoop2.6.0-cdh5.5.1  -Dscala-2.10.4 -Phive -Phive-thriftserver   -DskipTests clean package  > ./spark-mvn-`date +%Y%m%d%H`.log 2>&1 &

首次编译,需要 2 - 3 小时,具体看网络情况,(我编译多次,最后成功) 编译成功日志末尾如下

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ........................... SUCCESS [3.200 s]
[INFO] Spark Project Launcher ............................. SUCCESS [8.887 s]
[INFO] Spark Project Networking ........................... SUCCESS [8.270 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [4.832 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [6.082 s]
[INFO] Spark Project Core ................................. SUCCESS [01:52 min]
[INFO] Spark Project Bagel ................................ SUCCESS [5.129 s]
[INFO] Spark Project GraphX ............................... SUCCESS [13.442 s]
[INFO] Spark Project Streaming ............................ SUCCESS [30.683 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [43.622 s]
[INFO] Spark Project SQL .................................. SUCCESS [53.463 s]
[INFO] Spark Project ML Library ........................... SUCCESS [01:06 min]
[INFO] Spark Project Tools ................................ SUCCESS [2.225 s]
[INFO] Spark Project Hive ................................. SUCCESS [42.020 s]
[INFO] Spark Project REPL ................................. SUCCESS [8.500 s]
[INFO] Spark Project YARN ................................. SUCCESS [9.665 s]
[INFO] Spark Project Hive Thrift Server ................... SUCCESS [7.255 s]
[INFO] Spark Project Assembly ............................. SUCCESS [02:15 min]
[INFO] Spark Project External Twitter ..................... SUCCESS [7.330 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [5.103 s]
[INFO] Spark Project External Flume ....................... SUCCESS [8.405 s]
[INFO] Spark Project External Flume Assembly .............. SUCCESS [2.928 s]
[INFO] Spark Project External MQTT ........................ SUCCESS [15.932 s]
[INFO] Spark Project External MQTT Assembly ............... SUCCESS [7.792 s]
[INFO] Spark Project External ZeroMQ ...................... SUCCESS [6.057 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [10.135 s]
[INFO] Spark Project Examples ............................. SUCCESS [01:49 min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [8.111 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [5.814 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12:28 min
[INFO] Finished at: 2016-07-26T16:05:11+08:00
[INFO] Final Memory: 90M/1589M
[INFO] ------------------------------------------------------------------------

同时在如下目录会找到生成的 spark assembly 的 jar

/data/spark-1.5.0/assembly/target/scala-2.10/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar

三. 安装 spark assembly

1. 拷贝 assembly jar 包

到 CDH 机器 180.153..,将 jar 包远程拷贝过来,例如到 /home/hadoop 目录下

scp -P 50201 /data/spark-1.5.0/assembly/target/scala-2.10/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar root@180.153.*.*:/home/hadoop

然后再复制到 CDH 的 jars 目录下, 如果已存在,将其备份后删除

cp -p /home/hadoop/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars

2. 替换 CDH 中 spark 下的 assembly jar 包

其实就是修改软连接 spark-assembly.jar 指向 CDH 的 jars 目录下的 spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar,软连接所在路径:
/opt/cloudera/parcels/CDH/lib/spark/lib,删除原来的,新增连接

ln -s ../../../jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar  
ln -s  spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar  spark-assembly.jar

查看软连接情况

[root@db1 lib]# ll
total 209204
-rw-r--r-- 1 root root     21645 Dec  3  2015 python.tar.gz
lrwxrwxrwx 1 root root        68 Jan 14  2016 spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> ../../../jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
lrwxrwxrwx 1 root root        54 Jan 14  2016 spark-assembly.jar -> spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
lrwxrwxrwx 1 root root        68 Jan 14  2016 spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> ../../../jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
lrwxrwxrwx 1 root root        54 Jan 14  2016 spark-examples.jar -> spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
[root@db1 lib]# 

3. 拷贝 spark-sql 运行文件

从 spark 源文件的 bin 下拷贝到 CDH 的 spark 的 bin 目录下

scp -P 50201 /data/spark-1.5.0/bin/spark-sql root@180.153.*.*:/opt/cloudera/parcels/CDH/lib/spark/bin

4. 配置环境变量

export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_CMD=/opt/cloudera/parcels/CDH/bin/hadoop
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin

5. 拷贝 assembly jar 包拷贝到 HDFS

首先需要将 assembly jar 拷贝到 HDFS 的 /user/spark/share/lib 目录下,修改文件权限为 755

hadoop fs -put /home/hadoop/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar  /user/spark/share/lib

6. 在 CM 上配置

  • 登陆 CM, 修改 spark 的服务范围为 assembly jar 在 HDFS 中的路径
/user/spark/share/lib/spark-assembly-1.5.0-hadoop2.6.0.jar

重新编译、安装 spark assembly,使 CDH5.5.1 支持 sparkSQL

  • 修改 spark 的高级配置
spark.yarn.jar=hdfs://bestCluster/user/spark/share/lib/spark-assembly-1.5.0-hadoop2.6.0.jar

export HIVE_CONF_DIR=/opt/cloudera/parcels/CDH/lib/hive/conf

重新编译、安装 spark assembly,使 CDH5.5.1 支持 sparkSQL

  • 点击保存更改,再部署客户端配置即可。

7. 运行 spark-sql

已配置过环境变量,可在任意目录下执行 spark-sql

[hadoop@db1 ~]$ spark-sql
...
...
16/07/27 16:04:52 INFO metastore: Trying to connect to metastore with URI thrift://nn1.hadoop:9083
16/07/27 16:04:52 INFO metastore: Connected to metastore.
16/07/27 16:04:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/27 16:04:53 INFO SessionState: Created local directory: /tmp/462a9698-5bb6-4d17-bce3-9e162cfd40f8_resources
16/07/27 16:04:53 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8
16/07/27 16:04:53 INFO SessionState: Created local directory: /tmp/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8
16/07/27 16:04:53 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8/_tmp_space.db
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
spark-sql> 

tips:

1. 新建 / 拷贝的文件要赋予读写权限

2. 替换原有文件前,注意查看原有文件所属用户、软连接等信息

以上,完结!

本文永久更新链接地址 :http://www.linuxidc.com/Linux/2016-08/133847.htm

正文完
星哥说事-微信公众号
post-qrcode
 0
星锅
版权声明:本站原创文章,由 星锅 于2022-01-21发表,共计7010字。
转载说明:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
【腾讯云】推广者专属福利,新客户无门槛领取总价值高达2860元代金券,每种代金券限量500张,先到先得。
阿里云-最新活动爆款每日限量供应
评论(没有评论)
验证码
【腾讯云】云服务器、云数据库、COS、CDN、短信等云产品特惠热卖中