天天看點

ClickHouse 使用物化字段投影 PROJECTION 提升性能

ClickHouse 使用物化字段投影 PROJECTION 提升性能

1.1. 使用物化字段投影 PROJECTION 提升性能

ClickHouse作為ROLAP典型代表之一,純列式存儲單表查詢性能幾乎沒有對手。 但是,MergeTree 主鍵隻支援一種排序規則,對查詢性能的提升有局限。而有了Projection,我們就可以針對查詢主題,建立其他排序規則,實作預聚合優化(空間換時間)。

Projection 名字起源于Vertica,相當于傳統意義上的物化視圖。它借鑒 MOLAP 預聚合的思想,在資料寫入的時候,根據 projection 定義的表達式,計算寫入資料的聚合資料同原始資料一并寫入。資料查詢的過程中,如果查詢SQL通過分析可以通過聚合資料得出,直接查詢聚合資料減少計算的開銷,解決了由于資料量導緻的記憶體問題。Projeciton 底層存儲上屬于part目錄下資料的擴充,可以了解為查詢索引的一種形式。

1.1.1. 新增高基次元投影

新增高基次元投影

我們建立的clickhouse_tutorial.user_tag表的聯合索引排序鍵是order by (WatchID, UserID, EventTime),我們可以再選擇UserID字段作為高基次元投影字段,執行如下 SQL:

ALTER TABLE clickhouse_tutorial.user_tag ADD PROJECTION pUserID(SELECT * ORDER BY UserID);

統計字段的基數值

SELECT

    uniqExact(UserID),

    uniqExact(RegionID),

    uniqExact(OS)

FROM clickhouse_tutorial.user_tag

Query id: 35fcdb0b-f530-4f8c-8aa5-6faf71276e31

0 rows in set. Elapsed: 0.002 sec.

Received exception from server (version 21.12.1):

Code: 584. DB::Exception: Received from localhost:9000. DB::Exception: No projection is used when allow_experimental_projection_optimization = 1 and force_optimize_projection = 1. (PROJECTION_NOT_USED)

提示PROJECTION_NOT_USED,這個時候需要把強制使用 projection 優化設定關掉:set force_optimize_projection = 0。

重新執行上面的 SQL,輸出:

SELECT

    uniqExact(UserID),

    uniqExact(RegionID),

    uniqExact(OS)

FROM clickhouse_tutorial.user_tag

Query id: c3beca8c-b261-4e5f-bb6d-b78bbc233ef2

┌─uniqExact(UserID)─┬─uniqExact(RegionID)─┬─uniqExact(OS)─┐

│            119689 │                4727 │            68 │

└───────────────────┴─────────────────────┴───────────────┘

1 rows in set. Elapsed: 3.022 sec. Processed 1.70 billion rows, 22.05 GB (561.17 million rows/s., 7.30 GB/s.)

這樣我們就有了低基投影pOS、中基投影pRegionID、高基投影pUserID。

1.1.2. 建構測試資料

為了友善測試性能,我們構造 10 億行級别的測試資料。多次執行如下INSERT 語句:

INSERT INTO clickhouse_tutorial.user_tag

(UserID,

 WatchID,

 EventTime,

 Sex,

 Age,

 OS,

 RegionID,

 RequestNum,

 EventDate)

select UserID,

       WatchID,

       EventTime,

       Sex,

       Age,

       OS,

       RegionID,

       RequestNum,

       EventDate

from clickhouse_tutorial.user_tag;

資料行:

SELECT count()

FROM clickhouse_tutorial.user_tag

Query id: 2bb042e1-9559-4ee4-979d-cc7466a23715

┌────count()─┐

│ 1696002516 │

└────────────┘

1 rows in set. Elapsed: 0.001 sec.

1.1.3. 關閉 Projection 優化開關測試

關閉投影優化:

SET allow_experimental_projection_optimization=0;

執行查詢

select * from clickhouse_tutorial.user_tag where UserID = 1389883949241360436;

59888 rows in set. Elapsed: 0.669 sec. Processed 1.70 billion rows, 13.67 GB (2.54 billion rows/s., 20.44 GB/s.)

檢視執行計劃:

EXPLAIN actions = 1

SELECT *

FROM clickhouse_tutorial.user_tag

WHERE UserID = 1389883949241360436

Query id: b27ea9ba-4f0b-4db3-b315-f1e827529445

┌─explain───────────────────────────────────────────────────────────────────┐

│ Expression ((Projection + Before ORDER BY))                               │

│ Actions: INPUT :: 0 -> UserID UInt64 : 0                                  │

│          INPUT :: 1 -> WatchID UInt64 : 1                                 │

│          INPUT :: 2 -> EventTime DateTime : 2                             │

│          INPUT :: 3 -> Sex UInt8 : 3                                      │

│          INPUT :: 4 -> Age UInt8 : 4                                      │

│          INPUT :: 5 -> OS UInt8 : 5                                       │

│          INPUT :: 6 -> RegionID UInt32 : 6                                │

│          INPUT :: 7 -> RequestNum UInt32 : 7                              │

│          INPUT :: 8 -> EventDate Date : 8                                 │

│ Positions: 0 1 2 3 4 5 6 7 8                                              │

│   SettingQuotaAndLimits (Set limits and quota after reading from storage) │

│     ReadFromMergeTree                                                     │

│     ReadType: Default                                                     │

│     Parts: 47                                                             │

│     Granules: 207045                                                      │

└───────────────────────────────────────────────────────────────────────────┘

16 rows in set. Elapsed: 0.028 sec.

1.1.4. 開啟 Projection 優化開關測試

開啟投影優化:

SET allow_experimental_projection_optimization=1;

執行查詢:

select * from clickhouse_tutorial.user_tag where UserID = 1389883949241360436;

59888 rows in set. Elapsed: 0.119 sec. Processed 442.37 thousand rows, 12.23 MB (3.72 million rows/s., 102.86 MB/s.)

檢視執行計劃:

EXPLAIN actions = 1

SELECT *

FROM clickhouse_tutorial.user_tag

WHERE UserID = 1389883949241360436

Query id: f2db2a78-d275-4d55-ae10-5fe1cdf43f8a

┌─explain───────────────────────────────────────────────────────────────────┐

│ Expression ((Projection + Before ORDER BY))                               │

│ Actions: INPUT :: 0 -> UserID UInt64 : 0                                  │

│          INPUT :: 1 -> WatchID UInt64 : 1                                 │

│          INPUT :: 2 -> EventTime DateTime : 2                             │

│          INPUT :: 3 -> Sex UInt8 : 3                                      │

│          INPUT :: 4 -> Age UInt8 : 4                                      │

│          INPUT :: 5 -> OS UInt8 : 5                                       │

│          INPUT :: 6 -> RegionID UInt32 : 6                                │

│          INPUT :: 7 -> RequestNum UInt32 : 7                              │

│          INPUT :: 8 -> EventDate Date : 8                                 │

│ Positions: 0 1 2 3 4 5 6 7 8                                              │

│   SettingQuotaAndLimits (Set limits and quota after reading from storage) │

│     ReadFromStorage (MergeTree(with Normal projection pUserID))           │

└───────────────────────────────────────────────────────────────────────────┘

13 rows in set. Elapsed: 0.039 sec.

1.1.5. 性能資料

pUserID的性能對比

為了直覺友善地看到性能差距,特意整理pUserID的性能對比表如下。

SELECT *

FROM clickhouse_tutorial.user_tag

WHERE UserID = 1389883949241360436

pUserID(高基數)
對比項 關閉 Projection 優化 開啟 Projection 優化 倍數
掃描資料行 1700000000 442370 3843
處理資料大小(MB) 13998.08 12.23 1145
響應時間(秒) 0.669 0.119 6

pRegionID的性能對比

pRegionID的性能對比表如下。

SELECT

    RegionID,

    count(1)

FROM clickhouse_tutorial.user_tag

GROUP BY RegionID

開啟Projection優化:

4727 rows in set. Elapsed: 0.020 sec. Processed 98.99 thousand rows, 2.42 MB (4.97 million rows/s., 121.37 MB/s.)

不開啟Projection優化:

4727 rows in set. Elapsed: 1.005 sec. Processed 1.70 billion rows, 6.78 GB (1.69 billion rows/s., 6.75 GB/s.)

pRegionID(中基數)
對比項 關閉 Projection 優化 開啟 Projection 優化 倍數
掃描資料行 1700000000 98990 17173
處理資料大小(MB) 6942.72 2.42 2869
響應時間(秒) 1.005 0.02 50

pOS的性能對比

SELECT

    OS,

    count(1)

FROM clickhouse_tutorial.user_tag

GROUP BY OS

開啟Projection優化:

68 rows in set. Elapsed: 0.008 sec. Processed 2.43 thousand rows, 210.31 KB (309.74 thousand rows/s., 26.77 MB/s.)

不開啟Projection優化:

68 rows in set. Elapsed: 0.378 sec. Processed 1.70 billion rows, 1.70 GB (4.49 billion rows/s., 4.49 GB/s.)

pOS(低基數)
對比項 關閉 Projection 優化 開啟 Projection 優化 倍數
掃描資料行 1700000000 2430 699588
處理資料大小(MB) 1740.8 0.20538 8476
響應時間(秒) 0.378 0.005 76 

1.1.6. 次元字段基數對 Projection 性能的影響

可以看到由于UserID、RegionID、OS這 3 個字段的基數值的不同,對 Projection 查詢性能帶來的影響。如下表所示。

基數值 68 4727 119689
掃描資料行倍數 699588 17173 3843
處理資料大小倍數 8476 2869 1145
響應時間倍數 76 50 6

可以看出,基數越大,Projection 性能提升越小;基數越小,Projection 性能提升越明顯。從性能測試的資料上看,Projeciton對查詢性能有着百倍級别的提升。