Spark源码走读

共计 47250 个字符，预计需要花费 119 分钟才能阅读完成。

RDD 全称 Resilient Distributed DataSets，弹性的分布式数据集。是 Spark 的核心内容。

RDD 是只读的，不可变的数据集，也拥有很好的容错机制。他有 5 个主要特性

-A list of partitions 分片列表，数据能为切分才好做并行计算

-A function for computing each split 一个函数计算一个分片

-A list of dependencies on other RDDs 对其他 RDD 的依赖列表

-Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

RDD 可选的，key-value 型的 RDD，根据 hash 来分区

-Optionally, a list of preferred locations to compute each split on (e.g. blocklocations for

an HDFS file) 可选的，每一个分片的最佳计算位置

RDD 是 Spark 所有组件运行的底层系统，RDD 是一个容错的，并行的数据结构，它提供了丰富的数据操作和 API 接口

Spark 中的 RDD API

Spark 源码走读

一个 RDD 可以包含多个分区。每个分区都是一个 dataset 片段。RDD 之间可以相互依赖

窄依赖：一一对应的关系，一个 RDD 分区只能被一个子 RDD 的分区使用的关系

宽依赖：一多对应关系，若多个子 RDD 分区都依赖同一个父 RDD 分区

如下 RDD 图览

Spark 源码走读

在源码 packageorg.apache.spark.rdd.RDD 中有一些比较中的方法：

1）

/**

* Implemented by subclasses to return the set of partitions in this RDD. This method will only

* be called once, so it is safe to implement a time-consuming computation in it.

* 子类实现返回一组分区在这个 RDD。这种方法将只被调用一次，因此它是安全的, 它来实现一个耗时的计算。

protected def getPartitions: Array[Partition]

这个方法返回多个 partition，存放在一个数字中

2）

/**

* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only

* be called once, so it is safe to implement a time-consuming computation in it.

* 子类实现返回这个 RDD 如何取决于父 RDDS。这种方法将只被调用一次，因此它是安全的，它来实现一个耗时的计算。

protected def getDependencies: Seq[Dependency[_]] = deps

它返回一个依赖关系的 Seq 集合

3）

/**

* :: DeveloperApi ::

* Implemented by subclasses to compute a given partition.

* 子类实现的计算一个给定的分区。

@DeveloperApi

def compute(split: Partition, context: TaskContext): Iterator[T]

每个 RDD 都有一个对应的具体计算函数

4）

/**

* Optionally overridden by subclasses to specify placement preferences.

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

获取 partition 的首选位置，这是分区策略。

RDD 数据操作主要有两个动作：

Transformations(转换)

map(f : T) U) : RDD[T] ) RDD[U]
filter(f : T) Bool) : RDD[T] ) RDD[T]
flatMap(f : T) Seq[U]) : RDD[T] ) RDD[U]
sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)
groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]
reduceByKey(f : (V; V) ) V) : RDD[(K, V)] ) RDD[(K, V)]
union() : (RDD[T]; RDD[T]) ) RDD[T]
join() : (RDD[(K, V)]; RDD[(K, W)]) ) RDD[(K, (V, W))]
cogroup() : (RDD[(K, V)]; RDD[(K, W)]) ) RDD[(K, (Seq[V], Seq[W]))]
crossProduct() : (RDD[T]; RDD[U]) ) RDD[(T, U)]
mapValues(f : V) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)
sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]
partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]

Action(动作)

count() : RDD[T] ) Long
collect() : RDD[T] ) Seq[T]
reduce(f : (T; T) ) T) : RDD[T] ) T
lookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)
save(path : String) : Outputs RDD to a storage system, e.g., HDFS

先看下 Transformations 部分

// Transformations (return a new RDD)

/**

* Return a new RDD by applying a function to all elements of this RDD.

def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))

/**

* Return a new RDD by first applying a function to all elements of this

* RDD, and then flattening the results.

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] =

new FlatMappedRDD(this, sc.clean(f))

/**

* Return a new RDD containing only the elements that satisfy a predicate.

def filter(f: T => Boolean): RDD[T] = new FilteredRDD(this, sc.clean(f))

……

Map

/**

* Return a new RDD by applying a function to all elements of this RDD.

def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))

返回一个 MappedRDD，它继承 RDD 并重写了两个方法 getPartitions、compute

第一个方法 getPartitions，他获取第一个父 RDD，并获取分片数组

override def getPartitions: Array[Partition] = firstParent[T].partitions

第二个方法 compute，将根据 map 参数内容来遍历 RDD 分区

override def compute(split: Partition, context: TaskContext) =

firstParent[T].iterator(split, context).map(f)

filter

/**

* Return a new RDD containing only the elements that satisfy a predicate.

def filter(f: T => Boolean): RDD[T] = new FilteredRDD(this, sc.clean(f))

Filter 是一个过滤操作，比如 mapRDD.filter(_ >1)

Union

/**

* Return the union of this RDD and another one. Any identical elements will appear multiple

* times (use `.distinct()` to eliminate them).

def union(other: RDD[T]): RDD[T] = new UnionRDD(sc, Array(this, other))

多个 RDD 组成成一个新 RDD，它重写了 RDD 的 5 个方法 getPartitions、getDependencies、compute、getPreferredLocations、clearDependencies

从 getPartitions、getDependencies 中可以看出它应该是一组宽依赖关系

override def getDependencies: Seq[Dependency[_]] = {

val deps = new ArrayBuffer[Dependency[_]]

var pos = 0

for (rdd <- rdds) {

deps += new RangeDependency(rdd, 0, pos, rdd.partitions.size)

pos += rdd.partitions.size

}

deps

}

groupBy

/**

* Return an RDD of grouped items. Each group consists of a key and a sequence of elements

* mapping to that key.

* Note: This operation may be very expensive. If you are grouping in order to perform an

* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]

* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.

def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] =

groupBy[K](f, defaultPartitioner(this))

根据参数分组，这又产生了一个新的 RDD

Action

Count

/**

* Return the number of elements in the RDD.

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

跟踪代码，在 runJob 方法中调用了 dagScheduler.runJob。而在 DAGScheduler，将提交到作业调度器，并获得 JobWaiter 对象返回。该 JobWaiter 对象可以用来阻塞，直到任务完成执行或可以用来取消作业。

Spark 源码走读

从这个图中：

RDD Object 产生 DAG，然后进入 DAGScheduler 阶段：

1、DAGScheduler 是面向 Stage 的高层次调度器，DAGScheduler 会将 DAG 拆分成很多个 tasks，而一组 tasks 就是图中的 stage。

2、每一次 shuffle 的过程就会产生一个新的 stage。DAGScheduler 会有 RDD 记录磁盘的物· 理化操作，为了获得最有 tasks，DAGSchulder 会先查找本地 tasks。

3、DAGScheduler 还要监控 shuffle 产生的失败任务，如果还得重启

DAGScheduler 划分 stage 后，会以 TaskSet 为单位把任务提交给 TaskScheduler：

1、一个 TaskScheduler 只为一个 sparkConext 服务。

2、当接收到 TaskSet 后，它会把任务提交给 Worker 节点的 Executor 中去运行。失败的任务

由 TaskScheduler 监控重启。

Executor 是以多线程的方式运行，每个线程都负责一个任务。

接下来跟踪一个 spark 提供的例子源码：

源码 packageorg.apache.spark.examples.SparkPi

def main(args: Array[String]) {

// 设置一个应用名称 (用于在 Web UI 中显示)

val conf = new SparkConf().setAppName(“Spark Pi”)

// 实例化一个 SparkContext

val spark = new SparkContext(conf)

// 转成数据

val slices = if (args.length > 0) args(0).toInt else 2

val n = 100000 * slices

val count = spark.parallelize(1 to n, slices).map {i =>

val x = random * 2 – 1

val y = random * 2 – 1

if (x*x + y*y < 1) 1 else 0

}.reduce(_ + _)

println(“Pi is roughly ” + 4.0 * count / n)

spark.stop()

}

代码中的 parallelize 是一个并行化的延迟加载，跟踪源码

/** Distribute a local Scala collection to form an RDD.

* 从 RDD 中分配一个本地的 scala 集合

* @note Parallelize acts lazily. If `seq` is a mutable collection and is

* altered after the call to parallelize and before the first action on the

* RDD, the resultant RDD will reflect the modified collection. Pass a copy of

* the argument to avoid this.

def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = {

new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())

}

它调用了 RDD 中的 map，上面说过的 map 是一个转换过程，将生成一个新的 RDD。最后 reduce。

在 shell 中弄一个单词统计例子：

scala> val rdd = sc.textFile(“hdfs://192.168.0.245:8020/test/README.md”)

14/12/18 01:12:26 INFO storage.MemoryStore: ensureFreeSpace(82180) called with curMem=331133, maxMem=280248975

14/12/18 01:12:26 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 80.3 KB, free 266.9 MB)

rdd: org.apache.spark.rdd.RDD[String] = hdfs://192.168.0.245:8020/test/README.md MappedRDD[7] at textFile at <console>:12

scala> rdd.toDebugString

14/12/18 01:12:29 INFO mapred.FileInputFormat: Total input paths to process : 1

res3: String =

(1) hdfs://192.168.0.245:8020/test/README.md MappedRDD[7] at textFile at <console>:12

| hdfs://192.168.0.245:8020/test/README.md HadoopRDD[6] at textFile at <console>:12

Sc 是从 hdfs 中读取数据，那在 debugString 中他就转换成了 HadoopRDD

scala> val result = rdd.flatMap(_.split(” “)).map((_,1)).reduceByKey(_+_).collect

14/12/18 01:14:51 INFO spark.SparkContext: Starting job: collect at <console>:14

14/12/18 01:14:51 INFO scheduler.DAGScheduler: Registering RDD 9 (map at <console>:14)

14/12/18 01:14:51 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:14) with 1 output partitions (allowLocal=false)

14/12/18 01:14:51 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at <console>:14)

14/12/18 01:14:51 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1)

14/12/18 01:14:51 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)

14/12/18 01:14:51 INFO scheduler.DAGScheduler: Submitting Stage 1 (MappedRDD[9] at map at <console>:14), which has no missing parents

14/12/18 01:14:51 INFO storage.MemoryStore: ensureFreeSpace(3440) called with curMem=413313, maxMem=280248975

14/12/18 01:14:51 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 3.4 KB, free 266.9 MB)

14/12/18 01:14:51 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[9] at map at <console>:14)

14/12/18 01:14:51 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks

14/12/18 01:14:51 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, ANY, 1185 bytes)

14/12/18 01:14:51 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 0)

14/12/18 01:14:51 INFO rdd.HadoopRDD: Input split: hdfs://192.168.0.245:8020/test/README.md:0+4811

14/12/18 01:14:51 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

14/12/18 01:14:51 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

14/12/18 01:14:51 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

14/12/18 01:14:51 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

14/12/18 01:14:51 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

14/12/18 01:14:52 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 0). 1860 bytes result sent to driver

14/12/18 01:14:53 INFO scheduler.DAGScheduler: Stage 1 (map at <console>:14) finished in 1.450 s

14/12/18 01:14:53 INFO scheduler.DAGScheduler: looking for newly runnable stages

14/12/18 01:14:53 INFO scheduler.DAGScheduler: running: Set()

14/12/18 01:14:53 INFO scheduler.DAGScheduler: waiting: Set(Stage 0)

14/12/18 01:14:53 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 1419 ms on localhost (1/1)

14/12/18 01:14:53 INFO scheduler.DAGScheduler: failed: Set()

14/12/18 01:14:53 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

14/12/18 01:14:53 INFO scheduler.DAGScheduler: Missing parents for Stage 0: List()

14/12/18 01:14:53 INFO scheduler.DAGScheduler: Submitting Stage 0 (ShuffledRDD[10] at reduceByKey at <console>:14), which is now runnable

14/12/18 01:14:53 INFO storage.MemoryStore: ensureFreeSpace(2112) called with curMem=416753, maxMem=280248975

14/12/18 01:14:53 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.1 KB, free 266.9 MB)

14/12/18 01:14:53 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0 (ShuffledRDD[10] at reduceByKey at <console>:14)

14/12/18 01:14:53 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks

14/12/18 01:14:53 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 948 bytes)

14/12/18 01:14:53 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 1)

14/12/18 01:14:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329

14/12/18 01:14:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

14/12/18 01:14:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 5 ms

14/12/18 01:14:53 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 1). 8680 bytes result sent to driver

14/12/18 01:14:53 INFO scheduler.DAGScheduler: Stage 0 (collect at <console>:14) finished in 0.108 s

14/12/18 01:14:53 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 1) in 99 ms on localhost (1/1)

14/12/18 01:14:53 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

14/12/18 01:14:53 INFO spark.SparkContext: Job finished: collect at <console>:14, took 1.884598939 s

result: Array[(String, Int)] = Array((For,5), (Programs,1), (gladly,1), (Because,1), (The,1), (agree,1), (cluster.,1), (webpage,1), (its,1), (-Pyarn,3), (under,2), (legal,1), (APIs,1), (1.x,,1), (computation,1), (Try,1), (MRv1,,1), (have,2), (Thrift,2), (add,2), (through,1), (several,1), (This,2), (Whether,1), (“yarn-cluster”,1), (%,2), (graph,1), (storage,1), (To,2), (setting,2), (any,2), (Once,1), (application,1), (JDBC,3), (use:,1), (prefer,1), (SparkPi,2), (engine,1), (version,3), (file,1), (documentation,,1), (processing,,2), (Along,1), (the,28), (explicitly,,1), (entry,1), (author.,1), (are,2), (systems.,1), (params,1), (not,2), (different,1), (refer,1), (Interactive,2), (given.,1), (if,5), (`-Pyarn`:,1), (build,3), (when,3), (be,2), (Tests,1), (file’s,1), (Apache,6), (./bin/run-e…

根据空格来区分单词后，各个单词的统计结果

更多详情见请继续阅读下一页的精彩内容 ：http://www.linuxidc.com/Linux/2016-03/129062p2.htm

这里主要说明作业提交的过程源码。SparkSubmit 在 org.apache.spark.deploy 中，submit 是一个单独的进程，首先查看它的 main 方法：

Spark 源码走读

createLaunchEnv 方法中设置了一些配置参数：如返回值、集群模式、运行环境等。这里主要查看 Client 的集群模式。下面看下作业提交序列图：

Spark 源码走读

Client 的启动方法 preStart。

Spark 源码走读

Client 是一个 actor，Client 提交任务，首先需要封装好 DriverDescription 参数。包括 jar 文件 url、momory、cpu cores 等。然后向 Master 发送 RequestSubmitDriver 消息。

Master 中接收 RequestSubmitDriver 消息的处理：

Spark 源码走读

这里主要看下 schedule 这个方法：

Spark 源码走读

上面源码中。主要看；两个方法 launchDriver、launchExecutor

launchDriver：是让 worker 来启动 driver

Spark 源码走读

launchExecutor

Spark 源码走读

Master 向 Worker 发送了 LaunchDriver 和 LaunchExecutor。这里在就跟踪 Worker 下怎么处理 Master 发送的这两个消息。

LaunchDriver 启动 driver

Spark 源码走读

这里启动了 driver。而它在启动的时候就是创建目录然后下载 jar 包然后记载一些参数，最后向 work 发送 worker !DriverStateChanged(driverId, state, finalException)。Worker 接收到 DriverStateChanged 后将消息发给 Master。最后 Master 接收到这个消息，则移除 driver

LaunchExecutor

Worker 创建一个 ExecutorRunner 线程，ExecutorRunner 会启动 ExecutorBackend 进程

Spark 源码走读

这里真正的执行方法在 ExecutorRunner 中的 fetchAndRunExecutor 方法中。

接下来从一张流程图中简要描述了作业提交的流程。

Spark 源码走读

1）客户端启动后直接运行用户程序，启动 Driver 相关的工作：DAGScheduler 和 BlockManagerMaster 等。

2）客户端的 Driver 向 Master 注册。

3）Master 会让 Worker 启动 Exeuctor。

4）Worker 创建一个 ExecutorRunner 线程，ExecutorRunner 会启动 ExecutorBackend 进程。ExecutorBackend 启动后会向 Driver 的 SchedulerBackend 注册。

5）Driver 的 DAGScheduler 解析作业并生成相应的 Stage，每个 Stage 包含的 Task 通过 TaskScheduler 分配给 Executor 执行。所有 stage 都完成后作业结束。