天天看點

sparkSQL中partition by和group by差別及使用

1. partition by和group by差別和聯系

1)group by是分組函數,partition by是分析函數(然後像sum()等是聚合函數)

2)在執行順序上partition by應用在以上關鍵字之後,實際上就是在執行完select之後,在所得結果集之上進行partition,group by 使用常用sql關鍵字的優先級(from > where > group by > having > order by)

3)partition by相比較于group by,能夠在保留全部資料的基礎上,隻對其中某些字段做分組排序,而group by則隻保留參與分組的字段和聚合函數的結果

2. spark sql 使用group by

val df = Seq(
      ("ABC", "2019-02-10", "411626"),
      ("ABC", "2019-02-10", "411627"),
      ("BCD", "2020-04-01", "411626"),
      ("BCD", "2020-04-01", "411627"),
      ("BCD", "2020-04-02", "411626"),
      ("BCD", "2020-04-02", "411627"),
      ("DEF", "2019-01-09", "411626"))
      .toDF("user_id", "start_time", "end_time")

   df.groupBy(col("user_id"), col("start_time"))
      .agg(count(col("end_time")), sum(col("end_time")))
      .show()

+-------+----------+---------------+-------------+
|user_id|start_time|count(end_time)|sum(end_time)|
+-------+----------+---------------+-------------+
|    BCD|2020-04-02|              2|     823253.0|
|    ABC|2019-02-10|              2|     823253.0|
|    BCD|2020-04-01|              2|     823253.0|
|    DEF|2019-01-09|              1|     411626.0|
+-------+----------+---------------+-------------+      

2. spark sql 使用partition by

df.withColumn("rank",row_number().over(Window.partitionBy(col("user_id"), col("start_time")).orderBy(col("end_time"))))
    .show()

+-------+----------+--------+----+
|user_id|start_time|end_time|rank|
+-------+----------+--------+----+
|    BCD|2020-04-02|  411626|   1|
|    BCD|2020-04-02|  411627|   2|
|    ABC|2019-02-10|  411626|   1|
|    ABC|2019-02-10|  411627|   2|
|    BCD|2020-04-01|  411626|   1|
|    BCD|2020-04-01|  411627|   2|
|    DEF|2019-01-09|  411626|   1|
+-------+----------+--------+----+      

partition by 返所有資料列

3. group by實作返所有資料列

df.groupBy(col("user_id"), col("start_time"))
      .agg(count(col("end_time")), sum(col("end_time")), collect_set(col("end_time"))(0).as("end_time"))
      .show()

+-------+----------+---------------+-------------+--------+
|user_id|start_time|count(end_time)|sum(end_time)|end_time|
+-------+----------+---------------+-------------+--------+
|    BCD|2020-04-02|              2|     823253.0|  411627|
|    ABC|2019-02-10|              2|     823253.0|  411627|
|    BCD|2020-04-01|              2|     823253.0|  411627|
|    DEF|2019-01-09|              1|     411626.0|  411626|
+-------+----------+---------------+-------------+--------+      

使用 collect_set(去重)可以實作傳回所有列

我不生産知識 我隻是知識的搬運工