Spark 源码分析 -- Stage

2021-11-08 04:01:12

理解stage, 关键就是理解<code>narrow dependency</code>和<code>wide dependency</code>, 可能还是觉得比较难理解

关键在于是否需要shuffle, 不需要shuffle是可以随意并发的, 所以stage的边界就是需要shuffle的地方, 如下图很清楚

并且stage分为两种,

shuffle map stage, in which case its tasks' results are input for another stage

其实就是,非最终stage, 后面还有其他的stage, 所以它的输出一定是需要shuffle并作为后续的输入

result stage, in which case its tasks directly compute the action that initiated a job (e.g. count(), save(), etc)

最终的stage, 没有输出, 而是直接产生结果或存储

这个注释写的很清楚

可以看到stage的rdd参数只有一个rdd, final rdd, 而不是一系列的rdd

因为在一个stage中的所有rdd都是map, partition不会有任何改变, 只是在data依次执行不同的map function

所以对于task scheduler而言, 一个rdd的状况就可以代表这个stage

如果是shuffle map stage, 需要在这里向mapoutputtracker注册shuffle

可以根据final stage的deps找出所有的parent stage

继续阅读