Spark的速度快是以丧失计算结果正确性为代价的

193次阅读

共计 2536 个字符，预计需要花费 7 分钟才能阅读完成。

是的，Spark 很快。但是它不保证它算出的值是对的，哪怕你要做的只是简单的整数累加。

Spark 最著名的一篇论文是：《Spark: Cluster Computing with Working Sets》。当你读它的时候你需要明白：文中代码不保证计算结果是正确的。具体来说，它的 Logistic Regression 的代码在 map 阶段用到了 accumulator。下面解释为什么这么做是错误的。

假设有这样一个简单的任务：

input file 的每一行是 100 个整数，要求竖着加下来

例如：

输入

1 2 3 4 5 … 100
1 2 3 4 5 … 200
1 3 3 4 5 … 100

输出

3 7 9 12 15 … 400

很简单，对吧？是个猪都会算。在 Hadoop 上这个问题可以通过 Map reduce 来解决。首先把输入文件分成 N 个大小相等的块。然后每个块输出一行 100 个整数，如 2 4 6 8 10 … 200
然后 reducer 接收每个 mapper 的输出结果，累加起来得到最终结果。

缺点是：从 mapper 到 reducer 是需要 DISK-IO 及网络传输的。那么需要传输 N *100 个整数。当输入集的维数很大（每行有上百万个字节）的时候，很浪费。

spark 很巧妙的引入了 accumulator 的概念。同一台机器上所有的 task 的输出，会先在这个机器上进行本地汇总，然后再发给 reducer。这样就不再是 task 数量 * 维数，而是机器数量 * 维数。会节省不少。具体来说，在做机器学习的时候，大家很习惯的用 accumulator 来做这样的计算。

accumulator 是被很 careful 设计的。比如，只有 master 节点能读取 accumulator 的值，worker 节点不能。在“Performance and Scalability of Broadcast in Spark
”一文中，作者写到：“Accumulators can be defined for any type that has an“add”operation and a“zero”value. Due to their“add-only”semantics, they are easy to make fault-tolerant.”。但真的是这样吗？并不是。

accumulator 如果不是运行在运算的最后一环，那么正确性无法保证。因为 accumulator 不是 map/reduce 函数的输入或输出，accumulator 是表达式求值中的 side-effect。举个例子：

val acc = sc.accumulator(0)  
data.map(x => acc += 1; f(x))  
data.count()  
// acc should equal data.count() here
data.foreach{...}  
// Now, acc = 2 * data.count() because the map() was recomputed.

这个问题被 spark 的创始人 Matei 标为 Won’t Fix。

那么是不是写代码小心点不要触发重复计算就行了呢？也不是。task 是有可能 fail-retry 的，再或者因为某一个 task 执行的慢，所以同时有它的多个副本在跑。这些都可能会导致 accumulator 结果不正确。Accumulators 只能用在 RDD 的 actions 中，不能用在 Transformations。举例来说：可以在 reduce 函数中用，但是不能在 map 函数中用。

如果不用 accumlators，但又想节省网络传输，那么 Matei 说：“I would suggest creating fewer tasks. If your input file has a lot of blocks and hence a lot of parallel tasks, you can use CoalescedRDD to create an RDD with fewer blocks from it.”

意思就是说，那你就把 task 划分大一点，把 task 的数量减少。比如每台机器只有 1 个 task。Downside 其实也很明显，任务的执行容易不 balance。

参考：https://issues.apache.org/jira/browse/SPARK-732
https://issues.apache.org/jira/browse/SPARK-3628
https://issues.apache.org/jira/browse/SPARK-5490

https://github.com/apache/spark/pull/228

————————————– 分割线 ————————————–

Spark1.0.0 部署指南 http://www.linuxidc.com/Linux/2014-07/104304.htm

CentOS 6.2(64 位) 下安装 Spark0.8.0 详细记录 http://www.linuxidc.com/Linux/2014-06/102583.htm

Spark 简介及其在 Ubuntu 下的安装使用 http://www.linuxidc.com/Linux/2013-08/88606.htm

安装 Spark 集群 (在 CentOS 上) http://www.linuxidc.com/Linux/2013-08/88599.htm

Hadoop vs Spark 性能对比 http://www.linuxidc.com/Linux/2013-08/88597.htm

Spark 安装与学习 http://www.linuxidc.com/Linux/2013-08/88596.htm

Spark 并行计算模型 http://www.linuxidc.com/Linux/2012-12/76490.htm

————————————– 分割线 ————————————–