HybridDB for PostgreSQL , Greenplum 寫入性能優化實踐

标簽

PostgreSQL , Greenplum , HybridDB for PostgreSQL , insert , 性能

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#%E8%83%8C%E6%99%AF 背景

Greenplum寫入資料的性能優化實踐。

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#1-%E9%93%BE%E8%B7%AF 1 鍊路

盡量縮短用戶端和資料庫的鍊路，不要太多的跳數，比如NAT，PROXY，等越多，性能越差。

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#2-%E8%BF%9E%E6%8E%A5%E6%96%B9%E5%BC%8F 2 連接配接方式

盡量使用長連接配接，不要使用短連接配接，短連接配接的打開開銷非常大。

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#3-%E5%AD%98%E5%82%A8%E5%B1%9E%E6%80%A7 3 存儲屬性

where storage_parameter is:  
   APPENDONLY={TRUE|FALSE}  
   BLOCKSIZE={8192-2097152}  
   ORIENTATION={COLUMN|ROW}  
   COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|NONE}  
   COMPRESSLEVEL={0-9}  
   CHECKSUM={TRUE|FALSE}  
   FILLFACTOR={10-100}  
   OIDS[=TRUE|FALSE]

行存與列存的選擇方面，需要權衡插入與查詢的風格與需求的性能，同時它們支援互相轉換。

列存在單條INSERT時，性能較差，原因和列存的可靠性機制有關。

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#4-%E6%8F%92%E5%85%A5%E6%96%B9%E6%B3%95 4 插入方法

1、單條INSERT，單步送出，性能最差

2、單條INSERT，批量送出

3、批量INSERT

4、COPY

5、segment并行，性能最好

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#5-%E7%B4%A2%E5%BC%95 5 索引

索引越多，寫入性能越差。

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#6-%E7%BB%91%E5%AE%9A%E5%8F%98%E9%87%8F 6 綁定變量

如果是使用INSERT的單條寫入方法，可以考慮使用綁定變量。減少CPU硬解析。

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#7-%E5%B9%B6%E5%8F%91%E6%95%B0 7 并發數

INSERT的寫入，也可以使用并行（開多個連接配接）的方式，但是并不是開越多越好，通常不建議超過CPU核數。

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#8-%E5%80%BE%E6%96%9C 8 傾斜

傾斜，資料分布一定不要出現傾斜，否則就會導緻木桶效應，導緻某些SEGMENT繁忙，某些空閑，不利于整體性能。

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#%E4%BE%8B%E5%AD%90 例子

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#1-%E6%80%A7%E8%83%BD%E8%BE%83%E5%B7%AE%E7%9A%84%E4%BE%8B%E5%AD%90 1 性能較差的例子

使用列存，壓縮，大BLOCK，一條一條INSERT，單步送出，使用短連接配接。

每秒插入19行。

create table t_bad(  
id int,   
c1 int default (random()*100)::int,   
c2 int default (random()*100)::int,   
c3 int default (random()*100)::int,   
c4 int default (random()*100)::int,   
c5 int default (random()*100)::int,   
c6 int default (random()*100)::int,   
c7 int default (random()*100)::int,   
c8 int default (random()*100)::int,   
c9 int default (random()*100)::int,   
c10 int default (random()*100)::int,   
c11 int default (random()*100)::int,   
c12 int default (random()*100)::int,   
c13 int default (random()*100)::int,   
c14 int default (random()*100)::int,   
c15 int default (random()*100)::int,   
c16 int default (random()*100)::int,   
c17 int default (random()*100)::int,   
c18 int default (random()*100)::int,   
c19 int default (random()*100)::int,   
c20 int default (random()*100)::int,   
c21 int default (random()*100)::int,   
c22 int default (random()*100)::int,   
c23 int default (random()*100)::int,   
c24 int default (random()*100)::int,   
c25 int default (random()*100)::int,   
c26 int default (random()*100)::int,   
c27 int default (random()*100)::int,   
c28 int default (random()*100)::int,   
c29 int default (random()*100)::int,   
c30 int default (random()*100)::int,   
c31 int default (random()*100)::int,   
c32 int default (random()*100)::int,   
crt_time timestamp  
)   
with (APPENDONLY=true, BLOCKSIZE=2097152, ORIENTATION=column, COMPRESSTYPE=zlib, CHECKSUM=true);

vi test.sql  
\set id random(1,10000000)  
insert into t_bad (id, crt_time) values (:id, now());

/home/digoal/pgsql10.4/bin/pgbench -M simple -n -r -P 1 -f ./test.sql -C -c 1 -j 1 -T 120  
  
transaction type: ./test.sql  
scaling factor: 1  
query mode: simple  
number of clients: 1  
number of threads: 1  
duration: 120 s  
number of transactions actually processed: 2190  
latency average = 50.341 ms  
latency stddev = 1.752 ms  
tps = 18.243126 (including connections establishing)  
tps = 19.855318 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         0.003  \set id random(1,10000000)  
        50.338  insert into t_bad (id, crt_time) values (:id, now());

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#2-%E6%80%A7%E8%83%BD%E8%BE%83%E5%A5%BD%E7%9A%84%E4%BE%8B%E5%AD%90 2 性能較好的例子

使用行存，批量INSERT，使用長連接配接，使用并發INSERT。

每秒插入3.41萬行。

create table t_good(  
id int,   
c1 int default (random()*100)::int,   
c2 int default (random()*100)::int,   
c3 int default (random()*100)::int,   
c4 int default (random()*100)::int,   
c5 int default (random()*100)::int,   
c6 int default (random()*100)::int,   
c7 int default (random()*100)::int,   
c8 int default (random()*100)::int,   
c9 int default (random()*100)::int,   
c10 int default (random()*100)::int,   
c11 int default (random()*100)::int,   
c12 int default (random()*100)::int,   
c13 int default (random()*100)::int,   
c14 int default (random()*100)::int,   
c15 int default (random()*100)::int,   
c16 int default (random()*100)::int,   
c17 int default (random()*100)::int,   
c18 int default (random()*100)::int,   
c19 int default (random()*100)::int,   
c20 int default (random()*100)::int,   
c21 int default (random()*100)::int,   
c22 int default (random()*100)::int,   
c23 int default (random()*100)::int,   
c24 int default (random()*100)::int,   
c25 int default (random()*100)::int,   
c26 int default (random()*100)::int,   
c27 int default (random()*100)::int,   
c28 int default (random()*100)::int,   
c29 int default (random()*100)::int,   
c30 int default (random()*100)::int,   
c31 int default (random()*100)::int,   
c32 int default (random()*100)::int,   
crt_time timestamp  
) ;

vi test.sql  
  
\set id random(1,10000000)  
insert into t_good (id, crt_time) select random()*100000000,now() from generate_series(1,100) t(id);

/home/digoal/pgsql10.4/bin/pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 120  
  
transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 32  
number of threads: 32  
duration: 120 s  
number of transactions actually processed: 41006  
latency average = 93.658 ms  
latency stddev = 22.133 ms  
tps = 341.492522 (including connections establishing)  
tps = 341.562788 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         0.005  \set id random(1,10000000)  
        93.713  insert into t_good (id, crt_time) select random()*100000000,now() from generate_series(1,100) t(id);

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#%E5%B0%8F%E7%BB%93 小結

1、縮短鍊路，不要太多的跳數，比如NAT，PROXY，等越多，性能越差。

2、使用長連接配接。不要使用短連接配接。

3、盡量使用COPY或批量INSERT，不要單條INSERT。

4、行存與列存的選擇方面，需要權衡插入與查詢的風格與需求的性能，同時它們支援互相轉換。

5、一次性或間歇性導入，建議使用并行導入方式。通過segment并行導入的方式。

6、INSERT的寫入，也可以使用并行（開多個連接配接）的方式，但是并不是開越多越好，通常不建議超過CPU核數。

7、索引越多，寫入越慢。如果是批量一次性導入，可以先導入後并行建索引（同時建立多個索引）。

8、綁定變量，如果是使用INSERT的單條寫入方法，可以考慮使用綁定變量。減少CPU硬解析。

9、傾斜，資料分布一定不要出現傾斜，否則就會導緻木桶效應，導緻某些SEGMENT繁忙，某些空閑，不利于整體性能。

10、其他方面，比如資源隊列限制，硬體能力（CPU，IO，網絡等）都可能成為瓶頸，建議遇到性能問題時觀察。

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#%E5%8F%82%E8%80%83 參考

《分布式DB(Greenplum)中資料傾斜的原因和解法 - 阿裡雲HybridDB for PostgreSQL最佳實踐》《Greenplum insert的性能(單步\批量\copy) - 暨推薦使用gpfdist、阿裡雲oss外部表并行導入》《Greenplum & PostgreSQL UPSERT udf 實作 - 2 batch批量模式》《Greenplum & PostgreSQL UPSERT udf 實作 - 1 單行模式》《Greenplum 海量資料，大寬表行存 VS 列存》《Greenplum 行存、列存，堆表、AO表性能對比 - 阿裡雲HDB for PostgreSQL最佳實踐》《Greenplum 行存、列存，堆表、AO表的原理和選擇》《Greenplum 最佳實踐 - 行存與列存的選擇以及轉換方法》

HybridDB for PostgreSQL , Greenplum 寫入性能優化實踐

标簽

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#%E8%83%8C%E6%99%AF 背景

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#1-%E9%93%BE%E8%B7%AF 1 鍊路

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#2-%E8%BF%9E%E6%8E%A5%E6%96%B9%E5%BC%8F 2 連接配接方式

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#3-%E5%AD%98%E5%82%A8%E5%B1%9E%E6%80%A7 3 存儲屬性

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#4-%E6%8F%92%E5%85%A5%E6%96%B9%E6%B3%95 4 插入方法

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#5-%E7%B4%A2%E5%BC%95 5 索引

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#6-%E7%BB%91%E5%AE%9A%E5%8F%98%E9%87%8F 6 綁定變量

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#7-%E5%B9%B6%E5%8F%91%E6%95%B0 7 并發數

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#8-%E5%80%BE%E6%96%9C 8 傾斜

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#%E4%BE%8B%E5%AD%90 例子

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#1-%E6%80%A7%E8%83%BD%E8%BE%83%E5%B7%AE%E7%9A%84%E4%BE%8B%E5%AD%90 1 性能較差的例子

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#2-%E6%80%A7%E8%83%BD%E8%BE%83%E5%A5%BD%E7%9A%84%E4%BE%8B%E5%AD%90 2 性能較好的例子

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#%E5%B0%8F%E7%BB%93 小結

https://github.com/digoal/blog/blob/master/201807/20180702_01.md#%E5%8F%82%E8%80%83 參考

繼續閱讀

艱難安裝LDAP,SSL認證

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

結構體：typedef與struct的差別

neo4j之cypher使用文檔

C經典書籍筆記——C陷阱與缺陷②(文法陷阱之優先級)一、錯誤案列二、優先級規律

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

線性表之順序表的實作

登入plsql 報錯 the account is locked --使用者被鎖

C++判斷素數、求最大公約數代碼判斷一個數是否為素數求兩個數的最大公約數

SequoiaDB巨杉資料庫C++驅動概述

hdu7108哈希