PostgreSQL 自定義函數表達式選擇性評估算法 - Statistics, Cardinality, Selectivity, Estimate

标簽

PostgreSQL , 表達式 , 自定義函數 , 選擇性 , Statistics , Cardinality , Selectivity , Estimate

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E8%83%8C%E6%99%AF 背景

在資料庫中，統計資訊是估算成本（選擇性）的重要基礎，目前在PG中統計資訊的内容要麼是單列，要麼是自定義的多列統計資訊，要麼是表達式索引的統計資訊(

《PostgreSQL 11 preview - 表達式索引柱狀圖buckets\STATISTICS\default_statistics_target可設定》

)。

并不會針對沒有建立索引的表達式建構統計資訊。

那麼當輸入條件的一端是表達式（并且沒有索引）時，如何評估表達式與操作符發生計算時的選擇性呢？

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E9%80%89%E6%8B%A9%E6%80%A7%E8%AE%A1%E7%AE%97%E7%9A%84%E4%BB%A3%E7%A0%81 選擇性計算的代碼

https://www.postgresql.org/docs/devel/static/row-estimation-examples.html

For those interested in further details, estimation of the size of a table (before any WHERE clauses) is done in src/backend/optimizer/util/plancat.c. The generic logic for clause selectivities is in src/backend/optimizer/path/clausesel.c. The operator-specific selectivity functions are mostly found in src/backend/utils/adt/selfuncs.c.

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E6%89%93%E5%BC%80debug%E6%95%88%E6%9E%9C 打開debug效果

src/backend/optimizer/path/clausesel.c

Selectivity  
clause_selectivity(PlannerInfo *root,  
                                   Node *clause,  
                                   int varRelid,  
                                   JoinType jointype,  
                                   SpecialJoinInfo *sjinfo)  
{  
  
  
......  
  
  
// 注釋一下，友善輸出選擇性的值  
  
  
//#ifdef SELECTIVITY_DEBUG  
        elog(DEBUG4, "clause_selectivity: s1 %f", s1);  
//#endif

make  
make install  
  
  
重新開機

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E6%B5%8B%E8%AF%95 測試

create table a (id int);  
insert into a select generate_series(1,10000000);  
vacuum analyze a;

建立自定義函數

CREATE OR REPLACE FUNCTION public.f1(integer)  
 RETURNS integer  
 LANGUAGE sql  
 STRICT  
AS $function$  
  select case when $1<1000 then 100000 else 200000 end ;  
$function$;

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E6%89%93%E5%BC%80debug%E6%B5%8B%E8%AF%95%E9%80%89%E6%8B%A9%E6%80%A7 打開DEBUG，測試選擇性

1、普通字段的選擇性，算法對應=操作符pg_operator.oprrest字段對應的代碼。

普通字段，選擇性評估精準

postgres=# set client_min_messages ='debug5';  
  
postgres=# explain select * from a where id=1;  
DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0  
DEBUG:  clause_selectivity: s1 0.000000  
DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0  
                             QUERY PLAN                               
--------------------------------------------------------------------  
 Gather  (cost=1000.00..76498.03 rows=1 width=4)   -- rows=1很準确  
   Workers Planned: 4  
   ->  Parallel Seq Scan on a  (cost=0.00..75497.93 rows=1 width=4)  
         Filter: (id = 1)  
(4 rows)

2、改成自定義函數表達式如下，自定義表達式由于沒有統計資訊，是以表達式的估值行數很不準确。

postgres=# explain select * from a where f1(id)=1;  
DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0  
DEBUG:  clause_selectivity: s1 0.005000    // 注意  
DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0  
                        QUERY PLAN                           
-----------------------------------------------------------  
 Seq Scan on a  (cost=0.00..2669241.96 rows=50000 width=4)   -- rows=50000一點不準确  
   Filter: (f1(id) = 1)  
 JIT:  
   Functions: 2  
   Inlining: true  
   Optimization: true  
(6 rows)

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#gdb%E6%89%BE%E5%88%B0%E4%BC%B0%E5%80%BC%E8%AE%A1%E7%AE%97%E7%94%A8%E5%88%B0%E7%9A%84sel%E5%87%BD%E6%95%B0 gdb找到估值計算用到的sel函數

digoal@iZbp13nu0s9j3x3op4zpd4Z-> psql  
psql (11beta1)  
Type "help" for help.  
  
postgres=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
          53160  
(1 row)

gdb -p 53160  
  
b restriction_selectivity

對應代碼

src/backend/optimizer/path/clausesel.c  
  
  
Selectivity  
clause_selectivity(PlannerInfo *root,  
                                   Node *clause,  
                                   int varRelid,  
                                   JoinType jointype,  
                                   SpecialJoinInfo *sjinfo)  
{  
  
...........  
        else if (is_opclause(clause) || IsA(clause, DistinctExpr))  
        {  
                OpExpr     *opclause = (OpExpr *) clause;  
                Oid                     opno = opclause->opno;  
  
                if (treat_as_join_clause(clause, rinfo, varRelid, sjinfo))  
                {  
                        /* Estimate selectivity for a join clause. */  
                        s1 = join_selectivity(root, opno,  
                                                                  opclause->args,  
                                                                  opclause->inputcollid,  
                                                                  jointype,  
                                                                  sjinfo);  
                }  
                else  
                {  
                        /* Estimate selectivity for a restriction clause. */ -- 評估算法如下  
                        s1 = restriction_selectivity(root, opno,  
                                                                                 opclause->args,  
                                                                                 opclause->inputcollid,  
                                                                                 varRelid);  
                }  
  
                /*  
                 * DistinctExpr has the same representation as OpExpr, but the  
                 * contained operator is "=" not "<>", so we must negate the result.  
                 * This estimation method doesn't give the right behavior for nulls,  
                 * but it's better than doing nothing.  
                 */  
                if (IsA(clause, DistinctExpr))  
                        s1 = 1.0 - s1;  
        }

restriction_selectivity對應代碼如下，我們為了檢視選擇性，同樣這裡也加一個ELOG輸出。

src/backend/optimizer/util/plancat.c

/*  
 * restriction_selectivity  
 *  
 * Returns the selectivity of a specified restriction operator clause.  
 * This code executes registered procedures stored in the  
 * operator relation, by calling the function manager.  
 *  
 * See clause_selectivity() for the meaning of the additional parameters.  
 */  
Selectivity  
restriction_selectivity(PlannerInfo *root,  
                                                Oid operatorid,  
                                                List *args,  
                                                Oid inputcollid,  
                                                int varRelid)  
{  
        RegProcedure oprrest = get_oprrest(operatorid);  
        float8          result;  
  
        /*  
         * if the oprrest procedure is missing for whatever reason, use a  
         * selectivity of 0.5  
         */  
        if (!oprrest)  
                return (Selectivity) 0.5;  
  
        result = DatumGetFloat8(OidFunctionCall4Coll(oprrest,  
                                                                                                 inputcollid,  
                                                                                                 PointerGetDatum(root),  
                                                                                                 ObjectIdGetDatum(operatorid),  
                                                                                                 PointerGetDatum(args),  
                                                                                                 Int32GetDatum(varRelid)));  
// 加入如下代碼，列印restriction_selectivity函數計算得到的選擇性  
//#ifdef SELECTIVITY_DEBUG  
        elog(DEBUG4, "restriction_selectivity: result %f", result);  
//#endif   
  
        if (result < 0.0 || result > 1.0)  
                elog(ERROR, "invalid restriction selectivity: %f", result);  
  
        return (Selectivity) result;  
}

make  
make install  
  
  
重新開機

PostgreSQL 自定義函數表達式選擇性評估算法 - Statistics, Cardinality, Selectivity, Estimate

https://www.postgresql.org/docs/devel/static/catalog-pg-operator.html

explain時輸出DEBUG資訊如下，可以看到自定義函數的選擇性為0.005，自定義函數值沒有統計資訊柱狀圖，是不準确的根源。

postgres=# set client_min_messages ='debug5';  
DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0  
SET  
postgres=# explain select * from a where f1(id)=1;  
DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0  
DEBUG:  restriction_selectivity: result 0.005000  
DEBUG:  clause_selectivity: s1 0.005000  
DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0  
                        QUERY PLAN                          
----------------------------------------------------------  
 Seq Scan on a  (cost=0.00..219247.60 rows=50000 width=4)  
   Filter: (f1(id) = 1)  
 JIT:  
   Functions: 2  
   Inlining: false  
   Optimization: false  
(6 rows)  
  
postgres=# explain select * from a where id=1;  
DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0  
DEBUG:  restriction_selectivity: result 0.000000  
DEBUG:  clause_selectivity: s1 0.000000  
DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0  
                             QUERY PLAN                               
--------------------------------------------------------------------  
 Gather  (cost=1000.00..76498.03 rows=1 width=4)  
   Workers Planned: 4  
   ->  Parallel Seq Scan on a  (cost=0.00..75497.93 rows=1 width=4)  
         Filter: (id = 1)  
(4 rows)

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E5%87%BD%E6%95%B0%E8%A1%A8%E8%BE%BE%E5%BC%8F%E6%9F%B1%E7%8A%B6%E5%9B%BE%E7%BB%9F%E8%AE%A1%E4%BF%A1%E6%81%AF%E6%94%B6%E9%9B%86 自定義函數（表達式）柱狀圖統計資訊收集

前面的例子，自定義函數值沒有統計資訊柱狀圖，是不準确的根源。

那麼自定義表達式如何收集統計資訊呢？

實際上PG支援表達式索引，索引中包含了表達式的值，以及對應的HEAP TABLE 行号，有了表達式的值，實際上就可以作為統計資訊收集的要素。

如下：

1、建立表達式索引

postgres=# create index idx_a_1 on a(f1(id));

2、收集統計資訊

postgres=# vacuum analyze a;

3、通過索引名稱定位，檢視pg_stats内部是否有表達式的統計資訊了，沒錯，以及有了。

postgres=# select attname from pg_stats where tablename='idx_a_1';  
  
 attname   
---------  
 f1  
(1 row)

完備的統計資訊格式與内容請參考：

《PostgreSQL 統計資訊pg_statistic格式及導入導出dump_stat - 相容Oracle》

4、再次檢視執行計劃，選擇性正确了。

postgres=# explain select * from a where f1(id)=1;  
                           QUERY PLAN                              
-----------------------------------------------------------------  
 Index Scan using idx_a_1 on a  (cost=0.43..2.65 rows=1 width=4)  
   Index Cond: (f1(id) = 1)  
(2 rows)  
  
postgres=# explain select * from a where id=1;  
                             QUERY PLAN                               
--------------------------------------------------------------------  
 Gather  (cost=1000.00..76498.03 rows=1 width=4)  
   Workers Planned: 4  
   ->  Parallel Seq Scan on a  (cost=0.00..75497.93 rows=1 width=4)  
         Filter: (id = 1)  
(4 rows)  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where f1(id)=1;  
                                                    QUERY PLAN                                                      
------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_a_1 on public.a  (cost=0.43..2.65 rows=1 width=4) (actual time=0.105..0.105 rows=0 loops=1)  
   Output: id  
   Index Cond: (f1(a.id) = 1)  
   Buffers: shared read=3  
 Planning Time: 0.112 ms  
 Execution Time: 0.128 ms  
(6 rows)

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E5%B0%8F%E7%BB%93 小結

當WHERE 條件中的表達式并非普通列，而是表達式時，在表達式沒有索引的情況下，表達式的選擇性可能是非常不準确的。

為了得到更好的統計資訊，可以建立索引，因為索引内包含了這個表達式的結果值，索引建立完後，就可以收集這個表達式的統計資訊了。有了統計資訊，表達式的評估，選擇性計算就會非常準确。

不管是什麼表達式，評估選擇性都用到了restriction_selectivity，本文通過對PG的選擇性代碼添加elog，開啟DEBUG可以列印輸出當時的選擇性。

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E5%8F%82%E8%80%83 參考

《PostgreSQL 多值列的選擇性 - Statistics, Cardinality, Selectivity, Estimate》

PostgreSQL 自定義函數表達式選擇性評估算法 - Statistics, Cardinality, Selectivity, Estimate

标簽

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E8%83%8C%E6%99%AF 背景

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E9%80%89%E6%8B%A9%E6%80%A7%E8%AE%A1%E7%AE%97%E7%9A%84%E4%BB%A3%E7%A0%81 選擇性計算的代碼

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E6%89%93%E5%BC%80debug%E6%95%88%E6%9E%9C 打開debug效果

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E6%B5%8B%E8%AF%95 測試

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E6%89%93%E5%BC%80debug%E6%B5%8B%E8%AF%95%E9%80%89%E6%8B%A9%E6%80%A7 打開DEBUG，測試選擇性

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#gdb%E6%89%BE%E5%88%B0%E4%BC%B0%E5%80%BC%E8%AE%A1%E7%AE%97%E7%94%A8%E5%88%B0%E7%9A%84sel%E5%87%BD%E6%95%B0 gdb找到估值計算用到的sel函數

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E5%87%BD%E6%95%B0%E8%A1%A8%E8%BE%BE%E5%BC%8F%E6%9F%B1%E7%8A%B6%E5%9B%BE%E7%BB%9F%E8%AE%A1%E4%BF%A1%E6%81%AF%E6%94%B6%E9%9B%86 自定義函數（表達式）柱狀圖統計資訊收集

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E5%B0%8F%E7%BB%93 小結

https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E5%8F%82%E8%80%83 參考

繼續閱讀

Codeforces 1417 D. Make Them Equal(思維+構造)

查找算法之二分查找查找算法之二分查找

查找算法學習之二分查找（Python版本）——BinarySearch

CQ V1.0分詞bates(基于雙數組tire樹)—應該是目前最快的中文分詞算法

Command Network(POJ 3164)---定根最小樹形圖模闆題題目描述輸入格式輸出格式輸入樣例輸出樣例分析源程式

開源低帶寬語音編解碼器

241 Different Ways to Add Parentheses（C代碼版）

【趨高機器視覺】機器視覺技術原了解析及解決方案

CSMA/CD1． CSMA/CD的概述2． CSMA 的工作原理3． CSMA/CD控制規程及特點4． CSMA/CD協定5． CSMA/CD的優點6．結束語

極大似然法(ML)與最大期望法(EM)

C++ 第十五周報告1--《冒泡法排序》

筆試面試題目：滑動視窗(二)

資料結構與算法（27）——排序（二）

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

hdu7108哈希