【Spark】Spark Web UI - SQL

工作中經常會出現 Spark SQL 執行很慢或者失敗的情況，如果要排查問題，就必須要學會看 Spark Web UI。可以參考官網來學習：https://spark.apache.org/docs/3.2.1/web-ui.html#content。關于 Spark Web UI，上面有很多個 tab 頁，後面逐一學習。

今天學習一個常用的 tab —— SQL。

SQL Tab

If the application executes Spark SQL queries, the SQL tab displays information, such as the duration, jobs, and physical and logical plans for the queries. Here we include a basic example to illustrate this tab:

如果應用程式執行 Spark SQL 查詢，SQL 頁籤會顯示查詢的持續時間、作業以及實體和邏輯計劃等資訊。這裡我們包含一個基本示例來說明此頁籤：

scala> val df = Seq((1, "andy"), (2, "bob"), (2, "andy")).toDF("count", "name")
df: org.apache.spark.sql.DataFrame = [count: int, name: string]

scala> df.count
res0: Long = 3                                                                  

scala> df.createGlobalTempView("df")

scala> spark.sql("select name,sum(count) from global_temp.df group by name").show
+----+----------+
|name|sum(count)|
+----+----------+
|andy|         3|
| bob|         2|
+----+----------+

【Spark】Spark Web UI - SQL

Now the above three dataframe/SQL operators are shown in the list. If we click the ‘show at : 24’ link of the last query, we will see the DAG and details of the query execution.

現在上述三個 dataframe/SQL 運算符顯示在清單中。如果我們單擊最後一個查詢的

show at <console>: 24

連結，我們将看到 DAG 和查詢執行的詳細資訊。

【Spark】Spark Web UI - SQL

The query details page displays information about the query execution time, its duration, the list of associated jobs, and the query execution DAG. The first block ‘WholeStageCodegen (1)’ compiles multiple operators (‘LocalTableScan’ and ‘HashAggregate’) together into a single Java function to improve performance, and metrics like number of rows and spill size are listed in the block. The annotation ‘(1)’ in the block name is the code generation id. The second block ‘Exchange’ shows the metrics on the shuffle exchange, including number of written shuffle records, total data size, etc.

查詢詳細資訊頁面顯示有關查詢執行時間、持續時間、關聯作業清單和查詢執行 DAG 的資訊。第一個塊 “WholeStageCodegen (1)” 将多個運算符（ “LocalTableScan” 和 “HashAggregate” ）一起編譯成一個 Java 函數以提高性能，該塊中列出了行數和溢出大小等名額。塊名稱中的注釋 “(1)” 是代碼生成 ID。第二個 “Exchange” 塊顯示了 shuffle 交換的名額，包括寫入的 shuffle 記錄數、總資料大小等。

【Spark】Spark Web UI - SQL

Clicking the ‘Details’ link on the bottom displays the logical plans and the physical plan, which illustrate how Spark parses, analyzes, optimizes and performs the query. Steps in the physical plan subject to whole stage code generation optimization, are prefixed by a star followed by the code generation id, for example: ‘*(1) LocalTableScan’

單擊底部的

Details

連結會顯示邏輯計劃和實體計劃，說明 Spark 如何解析、分析、優化和執行查詢。實體計劃中要進行全階段代碼生成優化的步驟，以星号為字首，後跟代碼生成 id，例如：‘*(1) LocalTableScan’

SQL metrics

The metrics of SQL operators are shown in the block of physical operators. The SQL metrics can be useful when we want to dive into the execution details of each operator. For example, “number of output rows” can answer how many rows are output after a Filter operator, “shuffle bytes written total” in an Exchange operator shows the number of bytes written by a shuffle.

SQL 算子的名額顯示在實體算子塊中。當我們想要深入了解每個運算符的執行細節時，SQL 名額會很有用。例如，“number of output rows” 可以回答過濾操作符後輸出了多少行，交換操作符中的 “shuffle byteswritten total” 顯示了 shuffle 寫入的位元組數。

Here is the list of SQL metrics:

SQL metrics	Meaning	Operators
`number of output rows`	the number of output rows of the operator	Aggregate operators, Join operators, Sample, Range, Scan operators, Filter, etc.
`data size`	the size of broadcast/shuffled/collected data of the operator	BroadcastExchange, ShuffleExchange, Subquery
`time to collect`	the time spent on collecting data	BroadcastExchange, Subquery
`scan time`	the time spent on scanning data	ColumnarBatchScan, FileSourceScan
`metadata time`	the time spent on getting metadata like number of partitions, number of files	FileSourceScan
`shuffle bytes written`	the number of bytes written	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`shuffle records written`	the number of records written	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`shuffle write time`	the time spent on shuffle writing	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`remote blocks read`	the number of blocks read remotely	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`remote bytes read`	the number of bytes read remotely	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`remote bytes read to disk`	the number of bytes read from remote to local disk	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`local blocks read`	the number of blocks read locally	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`local bytes read`	the number of bytes read locally	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`fetch wait time`	the time spent on fetching data (local and remote)	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`records read`	the number of read records	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`sort time`	the time spent on sorting	Sort
`peak memory`	the peak memory usage in the operator	Sort, HashAggregate
`spill size`	number of bytes spilled to disk from memory in the operator	Sort, HashAggregate
`time in aggregation build`	the time spent on aggregation	HashAggregate, ObjectHashAggregate
`avg hash probe bucket list iters`	the average bucket list iterations per lookup during aggregation	HashAggregate
`data size of build side`	the size of built hash map	ShuffledHashJoin
`time to build hash map`	the time spent on building hash map	ShuffledHashJoin
`task commit time`	the time spent on committing the output of a task after the writes succeed	any write operation on a file-based table
`job commit time`	the time spent on committing the output of a job after the writes succeed	any write operation on a file-based table

歡迎點選此處關注公衆号。

【Spark】Spark Web UI - SQL

SQL Tab

SQL metrics

繼續閱讀

實驗樓sql進階之成績管理系統的資料操作(window)

Spark流式分析系統實作流式實時日志分析系統

Scala和Java二種方式實戰Spark Streaming開發

Spark基礎:Spark簡介及特點,運作模式,安裝Spark,Driver與Executor,Local模式,Standalone模式,Yarn模式,Mesos模式,WordCount案例,HA配置第1章 Spark概述第2章 Spark運作模式第3章案例實操

Spark實作wordcount

Oracle的基本操作

SQL優化SQL語句優化的目的

JAVA高效程式設計指南

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

關于SQL語言

SQL語言基礎：常用的資料查詢語句

大資料排錯SparkSpark叢集啟動時候，JAVA_HOME is not sethadoop叢集，某台伺服器jps無任何輸出IDEAkafkahadoopspark sqlfile permissionsIDEA本地測試 - OutOfMemoryError: GC overhead limit exceededhdfs負載均衡

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

neo4j之cypher使用文檔

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

sqlServer根據經緯查距離