關于citus.limit_clause_row_fetch_count優化參數關于citus.limit_clause_row_fetch_count優化參數

2018-10-03 23:50:00

關于 `citus.limit_clause_row_fetch_count` 優化參數

citus.limit_clause_row_fetch_count

是citus的一個性能優化的參數。具體适應于什麼場景呢？

官方文檔說明

下面是官方文檔說明，還是不夠具體。

https://docs.citusdata.com/en/v7.3/develop/api_guc.html?highlight=limit_clause_row_fetch_count

Planner Configuration

citus.limit_clause_row_fetch_count (integer)

Sets the number of rows to fetch per task for limit clause optimization. 
In some cases, select queries with limit clauses may need to fetch all 
rows from each task to generate results. In those cases, and where an 
approximation would produce meaningful results, this configuration value
 sets the number of rows to fetch from each shard. Limit approximations 
are disabled by default and this parameter is set to -1. This value can 
be set at run-time and is effective on the coordinator.

測試用例

從citus的測試用例中，可以清楚的看到它的作用。

citus-7.2.1\src\test\regress\expected\multi_limit_clause_approximate.out

-- Enable limit optimization to fetch one third of each shard's data
SET citus.limit_clause_row_fetch_count TO 600;
SELECT l_partkey, sum(l_partkey * (1 + l_suppkey)) AS aggregate FROM lineitem
    GROUP BY l_partkey
    ORDER BY aggregate DESC LIMIT 10;
DEBUG:  push down of limit count: 600
 l_partkey | aggregate  
-----------+------------
    194541 | 3727794642
    160895 | 3671463005
    183486 | 3128069328
    179825 | 3093889125
    162432 | 2834113536
    153937 | 2761321906
    199283 | 2726988572
    185925 | 2672114100
    196629 | 2622637602
    157064 | 2614644408
(10 rows)

上面的SQL，如果不加

SET citus.limit_clause_row_fetch_count TO 600

,CN需要到worker上把所有資料都撈出來，然後再在CN上排序取TopN結果。

大資料量的情況，性能會非常糟糕。加上

SET citus.limit_clause_row_fetch_count TO 600

，就隻會到每個worker上取前600的記錄。但可能會帶來準确性的損失。

另外一個需要注意的是，上面的GROUP BY字段

l_partkey

不是分片字段，如果GROUP BY字段已經包含了分片字段，不需要這個優化，因為這種情況下可以直接把LIMIT下推下去。

另一個測試用例，形式類似

...
SET citus.limit_clause_row_fetch_count TO 150;
SET citus.large_table_shard_count TO 2;
SELECT c_custkey, c_name, count(*) as lineitem_count
    FROM customer, orders, lineitem
    WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey
    GROUP BY c_custkey, c_name
    ORDER BY lineitem_count DESC, c_custkey LIMIT 10;
DEBUG:  push down of limit count: 150
 c_custkey |       c_name       | lineitem_count 
-----------+--------------------+----------------
        43 | Customer#000000043 |             42
       370 | Customer#000000370 |             38
        79 | Customer#000000079 |             37
       689 | Customer#000000689 |             36
       472 | Customer#000000472 |             35
       685 | Customer#000000685 |             35
       643 | Customer#000000643 |             34
       226 | Customer#000000226 |             33
       496 | Customer#000000496 |             32
       304 | Customer#000000304 |             31
(10 rows)

小結

适用場景
- citus.limit_clause_row_fetch_count 适用于分組聚合并取TopN結果的SQL的性能優化
不适用場景
- 要求精确結果
- 聚合字段包含分片字段
- count(DISTINCT)

關于citus.limit_clause_row_fetch_count優化參數關于citus.limit_clause_row_fetch_count優化參數

關于 `citus.limit_clause_row_fetch_count` 優化參數

官方文檔說明

測試用例

小結

繼續閱讀

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3建構hadoop項目

HDFS指令行工具

【51CTO學院三周年】自學路上的伴侶

線上教育巨頭多鄰國Duolingo入華一周年，中國市場馬力全開

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

Sql優化一：sql語句優化

Nacos 2.0 更新前後性能對比壓測

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

Storm編譯打包過程中遇到的一些問題及解決方法

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

9.spark Core 進階2--Cashe

淺談企業活動中進行資料分析的重要性

Ambari介紹和架構原理

NOSQL安全攻擊

win10本地scala和spark安裝安裝scala安裝spark

關于citus.limit_clause_row_fetch_count優化參數關于citus.limit_clause_row_fetch_count優化參數

關于 citus.limit_clause_row_fetch_count 優化參數

官方文檔說明

測試用例

小結

繼續閱讀

關于 `citus.limit_clause_row_fetch_count` 優化參數