Spark 源碼分析 -- Stage

2021-11-08 04:01:12

了解stage, 關鍵就是了解<code>narrow dependency</code>和<code>wide dependency</code>, 可能還是覺得比較難了解

關鍵在于是否需要shuffle, 不需要shuffle是可以随意并發的, 是以stage的邊界就是需要shuffle的地方, 如下圖很清楚

并且stage分為兩種,

shuffle map stage, in which case its tasks' results are input for another stage

其實就是,非最終stage, 後面還有其他的stage, 是以它的輸出一定是需要shuffle并作為後續的輸入

result stage, in which case its tasks directly compute the action that initiated a job (e.g. count(), save(), etc)

最終的stage, 沒有輸出, 而是直接産生結果或存儲

這個注釋寫的很清楚

可以看到stage的rdd參數隻有一個rdd, final rdd, 而不是一系列的rdd

因為在一個stage中的所有rdd都是map, partition不會有任何改變, 隻是在data依次執行不同的map function

是以對于task scheduler而言, 一個rdd的狀況就可以代表這個stage

如果是shuffle map stage, 需要在這裡向mapoutputtracker注冊shuffle

可以根據final stage的deps找出所有的parent stage

繼續閱讀