Pyspark實戰（三）wordcount算子分析

2023-05-27 15:03:31

Pyspark的本質還是調用scala的jar包，我們以上篇文章wordcount為例，其中一段代碼為：

rdd.flatMap(lambda x:x.split( )).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).foreach(lambda x:print(x))

其中：flatMap，map為轉換算子。

reduceByKey，foreach為執行算子，當rdd添加轉換算子的時候，rdd本身不會做任何操作，當執行算子添加時才會執行轉換算子。

我們把代碼定位到rdd.py的map,flatMap，源代碼如下：

def map(self, f, preservesPartitioning=False):

    """

    Return a new RDD by applying a function to each element of this RDD.



    >>> rdd = sc.parallelize(["b", "a", "c"])

    >>> sorted(rdd.map(lambda x: (x, 1)).collect())

    [('a', 1), ('b', 1), ('c', 1)]

    """

    def func(_, iterator):

        return map(f, iterator)

    return self.mapPartitionsWithIndex(func, preservesPartitioning)

def flatMap(self, f, preservesPartitioning=False):

    """

    Return a new RDD by first applying a function to all elements of this

    RDD, and then flattening the results.



    >>> rdd = sc.parallelize([2, 3, 4])

    >>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())

    [1, 1, 1, 2, 2, 3]

    >>> sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect())

    [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]

    """

    def func(s, iterator):

        return chain.from_iterable(map(f, iterator))

    return self.mapPartitionsWithIndex(func, preservesPartitioning)

map需要兩個參數，第一個參數為f，第二個從字面意思是分片數量。那麼f是什麼類型呢？我們從scala源代碼看可能更清楚一些：

/**

 * Return a new RDD by applying a function to all elements of this RDD.

 */

def map[U: ClassTag](f: T => U): RDD[U] = withScope {

  val cleanF = sc.clean(f)

  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))

}

map是一個泛型方法，這裡的U類型實際上可以是所有類型，這裡清楚的标明f的類型：f: T => U,f是一個參數為T， U為傳回值的匿名函數，算子最後傳回一個新的rdd

/**

 *  Return a new RDD by first applying a function to all elements of this

 *  RDD, and then flattening the results.

 */

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {

  val cleanF = sc.clean(f)

  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))

}

flat相比map，多了一步處理，就是将傳回的結果U進行TraversableOnce處理，意思是将U類型的集合分散并合并為一個新的集合。

是以，我們再回頭看看代碼：

rdd=sc.textFile(txtfile)

rdd是一個集合，集合的要素是文本檔案的一行資料，類似于Array[line]。

rdd.flatMap(lambda x:x.split( )).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).foreach(lambda x:print(x))

rdd.flatMap(lambda x:x.split( )).的意思是先将每個line通過空格分開，這時候line傳回的是Array[char]，最後通過TraversableOnce處理，多個Array[char]傳回一個Array[char]

map(lambda x:(x,1))的意思是将每一個值轉換成key,value對象，x為Array[char]的char值。

reduceByKey(lambda x,y:x+y)根據key值計算，相同k值相加運算。

Foreach周遊因子。

Pyspark實戰（三）wordcount算子分析

繼續閱讀

Eclipse運作WordCount（詳細版）相關連接配接Eclipse運作WordCount

HDFS指令行工具

【51CTO學院三周年】自學路上的伴侶

線上教育巨頭多鄰國Duolingo入華一周年，中國市場馬力全開

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

Sql優化一：sql語句優化

Nacos 2.0 更新前後性能對比壓測

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

Storm編譯打包過程中遇到的一些問題及解決方法

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

9.spark Core 進階2--Cashe

淺談企業活動中進行資料分析的重要性

Ambari介紹和架構原理

NOSQL安全攻擊

win10本地scala和spark安裝安裝scala安裝spark