标簽
PostgreSQL , 表達式 , 自定義函數 , 選擇性 , Statistics , Cardinality , Selectivity , Estimate
https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E8%83%8C%E6%99%AF 背景
在資料庫中,統計資訊是估算成本(選擇性)的重要基礎,目前在PG中統計資訊的内容要麼是單列,要麼是自定義的多列統計資訊,要麼是表達式索引的統計資訊(
《PostgreSQL 11 preview - 表達式索引柱狀圖buckets\STATISTICS\default_statistics_target可設定》)。
并不會針對沒有建立索引的表達式建構統計資訊。
那麼當輸入條件的一端是表達式(并且沒有索引)時,如何評估表達式與操作符發生計算時的選擇性呢?
https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E9%80%89%E6%8B%A9%E6%80%A7%E8%AE%A1%E7%AE%97%E7%9A%84%E4%BB%A3%E7%A0%81 選擇性計算的代碼
https://www.postgresql.org/docs/devel/static/row-estimation-examples.htmlFor those interested in further details, estimation of the size of a table (before any WHERE clauses) is done in src/backend/optimizer/util/plancat.c. The generic logic for clause selectivities is in src/backend/optimizer/path/clausesel.c. The operator-specific selectivity functions are mostly found in src/backend/utils/adt/selfuncs.c.
https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E6%89%93%E5%BC%80debug%E6%95%88%E6%9E%9C 打開debug效果
src/backend/optimizer/path/clausesel.c
Selectivity
clause_selectivity(PlannerInfo *root,
Node *clause,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo)
{
......
// 注釋一下,友善輸出選擇性的值
//#ifdef SELECTIVITY_DEBUG
elog(DEBUG4, "clause_selectivity: s1 %f", s1);
//#endif
make
make install
重新開機
https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E6%B5%8B%E8%AF%95 測試
create table a (id int);
insert into a select generate_series(1,10000000);
vacuum analyze a;
建立自定義函數
CREATE OR REPLACE FUNCTION public.f1(integer)
RETURNS integer
LANGUAGE sql
STRICT
AS $function$
select case when $1<1000 then 100000 else 200000 end ;
$function$;
https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E6%89%93%E5%BC%80debug%E6%B5%8B%E8%AF%95%E9%80%89%E6%8B%A9%E6%80%A7 打開DEBUG,測試選擇性
1、普通字段的選擇性,算法對應=操作符pg_operator.oprrest字段對應的代碼。
普通字段,選擇性評估精準
postgres=# set client_min_messages ='debug5';
postgres=# explain select * from a where id=1;
DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
DEBUG: clause_selectivity: s1 0.000000
DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
QUERY PLAN
--------------------------------------------------------------------
Gather (cost=1000.00..76498.03 rows=1 width=4) -- rows=1很準确
Workers Planned: 4
-> Parallel Seq Scan on a (cost=0.00..75497.93 rows=1 width=4)
Filter: (id = 1)
(4 rows)
2、改成自定義函數表達式如下,自定義表達式由于沒有統計資訊,是以表達式的估值行數很不準确。
postgres=# explain select * from a where f1(id)=1;
DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
DEBUG: clause_selectivity: s1 0.005000 // 注意
DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
QUERY PLAN
-----------------------------------------------------------
Seq Scan on a (cost=0.00..2669241.96 rows=50000 width=4) -- rows=50000一點不準确
Filter: (f1(id) = 1)
JIT:
Functions: 2
Inlining: true
Optimization: true
(6 rows)
https://github.com/digoal/blog/blob/master/201806/20180625_02.md#gdb%E6%89%BE%E5%88%B0%E4%BC%B0%E5%80%BC%E8%AE%A1%E7%AE%97%E7%94%A8%E5%88%B0%E7%9A%84sel%E5%87%BD%E6%95%B0 gdb找到估值計算用到的sel函數
digoal@iZbp13nu0s9j3x3op4zpd4Z-> psql
psql (11beta1)
Type "help" for help.
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
53160
(1 row)
gdb -p 53160
b restriction_selectivity
對應代碼
src/backend/optimizer/path/clausesel.c
Selectivity
clause_selectivity(PlannerInfo *root,
Node *clause,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo)
{
...........
else if (is_opclause(clause) || IsA(clause, DistinctExpr))
{
OpExpr *opclause = (OpExpr *) clause;
Oid opno = opclause->opno;
if (treat_as_join_clause(clause, rinfo, varRelid, sjinfo))
{
/* Estimate selectivity for a join clause. */
s1 = join_selectivity(root, opno,
opclause->args,
opclause->inputcollid,
jointype,
sjinfo);
}
else
{
/* Estimate selectivity for a restriction clause. */ -- 評估算法如下
s1 = restriction_selectivity(root, opno,
opclause->args,
opclause->inputcollid,
varRelid);
}
/*
* DistinctExpr has the same representation as OpExpr, but the
* contained operator is "=" not "<>", so we must negate the result.
* This estimation method doesn't give the right behavior for nulls,
* but it's better than doing nothing.
*/
if (IsA(clause, DistinctExpr))
s1 = 1.0 - s1;
}
restriction_selectivity對應代碼如下,我們為了檢視選擇性,同樣這裡也加一個ELOG輸出。
src/backend/optimizer/util/plancat.c
/*
* restriction_selectivity
*
* Returns the selectivity of a specified restriction operator clause.
* This code executes registered procedures stored in the
* operator relation, by calling the function manager.
*
* See clause_selectivity() for the meaning of the additional parameters.
*/
Selectivity
restriction_selectivity(PlannerInfo *root,
Oid operatorid,
List *args,
Oid inputcollid,
int varRelid)
{
RegProcedure oprrest = get_oprrest(operatorid);
float8 result;
/*
* if the oprrest procedure is missing for whatever reason, use a
* selectivity of 0.5
*/
if (!oprrest)
return (Selectivity) 0.5;
result = DatumGetFloat8(OidFunctionCall4Coll(oprrest,
inputcollid,
PointerGetDatum(root),
ObjectIdGetDatum(operatorid),
PointerGetDatum(args),
Int32GetDatum(varRelid)));
// 加入如下代碼,列印restriction_selectivity函數計算得到的選擇性
//#ifdef SELECTIVITY_DEBUG
elog(DEBUG4, "restriction_selectivity: result %f", result);
//#endif
if (result < 0.0 || result > 1.0)
elog(ERROR, "invalid restriction selectivity: %f", result);
return (Selectivity) result;
}
make
make install
重新開機

explain時輸出DEBUG資訊如下,可以看到自定義函數的選擇性為0.005,自定義函數值沒有統計資訊柱狀圖,是不準确的根源。
postgres=# set client_min_messages ='debug5';
DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
SET
postgres=# explain select * from a where f1(id)=1;
DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
DEBUG: restriction_selectivity: result 0.005000
DEBUG: clause_selectivity: s1 0.005000
DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
QUERY PLAN
----------------------------------------------------------
Seq Scan on a (cost=0.00..219247.60 rows=50000 width=4)
Filter: (f1(id) = 1)
JIT:
Functions: 2
Inlining: false
Optimization: false
(6 rows)
postgres=# explain select * from a where id=1;
DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
DEBUG: restriction_selectivity: result 0.000000
DEBUG: clause_selectivity: s1 0.000000
DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
QUERY PLAN
--------------------------------------------------------------------
Gather (cost=1000.00..76498.03 rows=1 width=4)
Workers Planned: 4
-> Parallel Seq Scan on a (cost=0.00..75497.93 rows=1 width=4)
Filter: (id = 1)
(4 rows)
https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E5%87%BD%E6%95%B0%E8%A1%A8%E8%BE%BE%E5%BC%8F%E6%9F%B1%E7%8A%B6%E5%9B%BE%E7%BB%9F%E8%AE%A1%E4%BF%A1%E6%81%AF%E6%94%B6%E9%9B%86 自定義函數(表達式)柱狀圖統計資訊收集
前面的例子,自定義函數值沒有統計資訊柱狀圖,是不準确的根源。
那麼自定義表達式如何收集統計資訊呢?
實際上PG支援表達式索引,索引中包含了表達式的值,以及對應的HEAP TABLE 行号,有了表達式的值,實際上就可以作為統計資訊收集的要素。
如下:
1、建立表達式索引
postgres=# create index idx_a_1 on a(f1(id));
2、收集統計資訊
postgres=# vacuum analyze a;
3、通過索引名稱定位,檢視pg_stats内部是否有表達式的統計資訊了,沒錯,以及有了。
postgres=# select attname from pg_stats where tablename='idx_a_1';
attname
---------
f1
(1 row)
完備的統計資訊格式與内容請參考:
《PostgreSQL 統計資訊pg_statistic格式及導入導出dump_stat - 相容Oracle》4、再次檢視執行計劃,選擇性正确了。
postgres=# explain select * from a where f1(id)=1;
QUERY PLAN
-----------------------------------------------------------------
Index Scan using idx_a_1 on a (cost=0.43..2.65 rows=1 width=4)
Index Cond: (f1(id) = 1)
(2 rows)
postgres=# explain select * from a where id=1;
QUERY PLAN
--------------------------------------------------------------------
Gather (cost=1000.00..76498.03 rows=1 width=4)
Workers Planned: 4
-> Parallel Seq Scan on a (cost=0.00..75497.93 rows=1 width=4)
Filter: (id = 1)
(4 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where f1(id)=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using idx_a_1 on public.a (cost=0.43..2.65 rows=1 width=4) (actual time=0.105..0.105 rows=0 loops=1)
Output: id
Index Cond: (f1(a.id) = 1)
Buffers: shared read=3
Planning Time: 0.112 ms
Execution Time: 0.128 ms
(6 rows)
https://github.com/digoal/blog/blob/master/201806/20180625_02.md#%E5%B0%8F%E7%BB%93 小結
當WHERE 條件中的表達式并非普通列,而是表達式時,在表達式沒有索引的情況下,表達式的選擇性可能是非常不準确的。
為了得到更好的統計資訊,可以建立索引,因為索引内包含了這個表達式的結果值,索引建立完後,就可以收集這個表達式的統計資訊了。有了統計資訊,表達式的評估,選擇性計算就會非常準确。
不管是什麼表達式,評估選擇性都用到了restriction_selectivity,本文通過對PG的選擇性代碼添加elog,開啟DEBUG可以列印輸出當時的選擇性。