天天看點

【學習資料】第8期PostgreSQL 規格評估 - 微觀、宏觀、精準 多視角估算資料庫性能(選型、做預算不求人)

背景

在提預算時必不可少的環境是評估需要多少硬體。

通常會要求業務方提供一些資料,例如使用者數、PV、UV等。但是這種評估純靠經驗,方法非常的粗糙也不準确。

那麼到底如何評估需要多少硬體、或者說需要什麼樣規格的硬體來支撐你未來的業務呢?

對于PostgreSQL這個資料庫産品來說,我介紹一下三種評估方法:

1、微觀評估(相對來說比較準确)

2、宏觀評估(對選型有幫助,對規格幫助不大,略顯粗糙)

3、精準評估(最為準确,但是要求對業務非常熟悉,對未來的瓶頸把握準确)

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E4%B8%80%E5%BE%AE%E8%A7%82%E4%BC%B0%E7%AE%97%E6%B3%95 一、微觀估算法

我們在通過SQL與資料庫互動時,資料庫是如何執行SQL的呢?

首先要PARSE SQL,然後生成執行路徑,選擇最優執行路徑,執行SQL,最關鍵的是選擇最優執行路徑。PostgreSQL是CBO的優化器,根據成本選擇。

這裡提到了成本,成本是怎麼算出來的呢?成本是結合掃描方法、統計資訊、估算需要掃描多少個資料塊,掃描多少條記錄,最後通過對應掃描方法的成本估算算法算出來的。

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E4%B8%80%E4%B8%AA-query-%E6%9C%89%E5%93%AA%E4%BA%9B%E6%88%90%E6%9C%AC 一個 QUERY 有哪些成本

1、成本包括:

IO成本,CPU成本。

2、IO成本包括:

連續IO成本,離散IO層闆。

3、CPU成本包括:

擷取索引、TOAST索引、堆表、TOAST表的tuple或ITEM的成本;

操作符、函數處理行的成本;

處理JOIN的成本等等。

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E4%B8%80%E4%B8%AA-query-%E5%A6%82%E4%BD%95%E6%89%A7%E8%A1%8C%E5%92%8C%E4%BC%A0%E9%80%92%E6%88%90%E6%9C%AC 一個 QUERY 如何執行和傳遞成本

生成好執行計劃後,QUERY的執行就會按執行樹來執行

【學習資料】第8期PostgreSQL 規格評估 - 微觀、宏觀、精準 多視角估算資料庫性能(選型、做預算不求人)

執行樹由若幹個節點組成,從一個節點,跳到下一個節點,就好像接力賽一樣。

節點跟節點之間傳遞的是什麼呢?

Path資料結構,主要包含(rows, startup_cost, total_cost)。一個資料節點

rows,表示這個節點有多少滿足條件的行,輸出到下一個節點。

startup_cost,表示這個節點得到第一條符合條件的記錄,需要多少成本。

total_cost,表示這個節點得到所有符合條件的記錄,需要多少成本。

執行節點有哪些種類

執行節點的種類很多,可以從成本計算的代碼中得到:

src/backend/optimizer/path/costsize.c

/*  
 * cost_seqscan  
 *        Determines and returns the cost of scanning a relation sequentially.  
 *  
 * 'baserel' is the relation to be scanned  
 * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL  
 */  
cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)  
  
/*  
 * cost_samplescan  
 *        Determines and returns the cost of scanning a relation using sampling.  
 *  
 * 'baserel' is the relation to be scanned  
 * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL  
 */  
cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)  
  
/*  
 * cost_gather  
 *        Determines and returns the cost of gather path.  
 *  
 * 'rel' is the relation to be operated upon  
 * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL  
 * 'rows' may be used to point to a row estimate; if non-NULL, it overrides  
 * both 'rel' and 'param_info'.  This is useful when the path doesn't exactly  
 * correspond to any particular RelOptInfo.  
 */  
cost_gather(GatherPath *path, PlannerInfo *root, RelOptInfo *rel,   
ParamPathInfo *param_info, double *rows)  
  
/*  
 * cost_gather_merge  
 *        Determines and returns the cost of gather merge path.  
 *  
 * GatherMerge merges several pre-sorted input streams, using a heap that at  
 * any given instant holds the next tuple from each stream. If there are N  
 * streams, we need about N*log2(N) tuple comparisons to construct the heap at  
 * startup, and then for each output tuple, about log2(N) comparisons to  
 * replace the top heap entry with the next tuple from the same stream.  
 */  
cost_gather_merge(GatherMergePath *path, PlannerInfo *root, RelOptInfo *rel,   
ParamPathInfo *param_info, Cost input_startup_cost, Cost input_total_cost, double *rows)  
  
  
/*  
 * cost_index  
 *        Determines and returns the cost of scanning a relation using an index.  
 *  
 * 'path' describes the indexscan under consideration, and is complete  
 *              except for the fields to be set by this routine  
 * 'loop_count' is the number of repetitions of the indexscan to factor into  
 *              estimates of caching behavior  
 *  
 * In addition to rows, startup_cost and total_cost, cost_index() sets the  
 * path's indextotalcost and indexselectivity fields.  These values will be  
 * needed if the IndexPath is used in a BitmapIndexScan.  
 *  
 * NOTE: path->indexquals must contain only clauses usable as index  
 * restrictions.  Any additional quals evaluated as qpquals may reduce the  
 * number of returned tuples, but they won't reduce the number of tuples  
 * we have to fetch from the table, so they don't reduce the scan cost.  
 */  
cost_index(IndexPath *path, PlannerInfo *root, double loop_count, bool partial_path)  
  
/*  
 * cost_bitmap_heap_scan  
 *        Determines and returns the cost of scanning a relation using a bitmap  
 *        index-then-heap plan.  
 *  
 * 'baserel' is the relation to be scanned  
 * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL  
 * 'bitmapqual' is a tree of IndexPaths, BitmapAndPaths, and BitmapOrPaths  
 * 'loop_count' is the number of repetitions of the indexscan to factor into  
 *              estimates of caching behavior  
 *  
 * Note: the component IndexPaths in bitmapqual should have been costed  
 * using the same loop_count.  
 */  
cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,   
ParamPathInfo *param_info, Path *bitmapqual, double loop_count)  
  
/*  
 * cost_bitmap_tree_node  
 *              Extract cost and selectivity from a bitmap tree node (index/and/or)  
 */  
cost_bitmap_tree_node(Path *path, Cost *cost, Selectivity *selec)  
  
  
/*  
 * cost_bitmap_and_node  
 *              Estimate the cost of a BitmapAnd node  
 *  
 * Note that this considers only the costs of index scanning and bitmap  
 * creation, not the eventual heap access.  In that sense the object isn't  
 * truly a Path, but it has enough path-like properties (costs in particular)  
 * to warrant treating it as one.  We don't bother to set the path rows field,  
 * however.  
 */  
cost_bitmap_and_node(BitmapAndPath *path, PlannerInfo *root)  
  
  
/*  
 * cost_bitmap_or_node  
 *              Estimate the cost of a BitmapOr node  
 *  
 * See comments for cost_bitmap_and_node.  
 */  
cost_bitmap_or_node(BitmapOrPath *path, PlannerInfo *root)  
  
  
/*  
 * cost_tidscan  
 *        Determines and returns the cost of scanning a relation using TIDs.  
 *  
 * 'baserel' is the relation to be scanned  
 * 'tidquals' is the list of TID-checkable quals  
 * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL  
 */  
cost_tidscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,   
List *tidquals, ParamPathInfo *param_info)  
  
  
/*  
 * cost_subqueryscan  
 *        Determines and returns the cost of scanning a subquery RTE.  
 *  
 * 'baserel' is the relation to be scanned  
 * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL  
 */  
cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,   
RelOptInfo *baserel, ParamPathInfo *param_info)  
  
  
/*  
 * cost_functionscan  
 *        Determines and returns the cost of scanning a function RTE.  
 *  
 * 'baserel' is the relation to be scanned  
 * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL  
 */  
cost_functionscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)  
  
  
/*  
 * cost_tablefuncscan  
 *        Determines and returns the cost of scanning a table function.  
 *  
 * 'baserel' is the relation to be scanned  
 * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL  
 */  
cost_tablefuncscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)  
  
  
/*  
 * cost_valuesscan  
 *        Determines and returns the cost of scanning a VALUES RTE.  
 *  
 * 'baserel' is the relation to be scanned  
 * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL  
 */  
cost_valuesscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)  
  
  
/*  
 * cost_ctescan  
 *        Determines and returns the cost of scanning a CTE RTE.  
 *  
 * Note: this is used for both self-reference and regular CTEs; the  
 * possible cost differences are below the threshold of what we could  
 * estimate accurately anyway.  Note that the costs of evaluating the  
 * referenced CTE query are added into the final plan as initplan costs,  
 * and should NOT be counted here.  
 */  
cost_ctescan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info)  
cost_namedtuplestorescan(Path *path, PlannerInfo *root,   
RelOptInfo *baserel, ParamPathInfo *param_info)  
  
  
/*  
 * cost_recursive_union  
 *        Determines and returns the cost of performing a recursive union,  
 *        and also the estimated output size.  
 *  
 * We are given Paths for the nonrecursive and recursive terms.  
 */  
cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)  
  
  
/*  
 * cost_sort  
 *        Determines and returns the cost of sorting a relation, including  
 *        the cost of reading the input data.  
 *  
 * If the total volume of data to sort is less than sort_mem, we will do  
 * an in-memory sort, which requires no I/O and about t*log2(t) tuple  
 * comparisons for t tuples.  
 *  
 * If the total volume exceeds sort_mem, we switch to a tape-style merge  
 * algorithm.  There will still be about t*log2(t) tuple comparisons in  
 * total, but we will also need to write and read each tuple once per  
 * merge pass.  We expect about ceil(logM(r)) merge passes where r is the  
 * number of initial runs formed and M is the merge order used by tuplesort.c.  
 * Since the average initial run should be about sort_mem, we have  
 *              disk traffic = 2 * relsize * ceil(logM(p / sort_mem))  
 *              cpu = comparison_cost * t * log2(t)  
 *  
 * If the sort is bounded (i.e., only the first k result tuples are needed)  
 * and k tuples can fit into sort_mem, we use a heap method that keeps only  
 * k tuples in the heap; this will require about t*log2(k) tuple comparisons.  
 *  
 * The disk traffic is assumed to be 3/4ths sequential and 1/4th random  
 * accesses (XXX can't we refine that guess?)  
 *  
 * By default, we charge two operator evals per tuple comparison, which should  
 * be in the right ballpark in most cases.  The caller can tweak this by  
 * specifying nonzero comparison_cost; typically that's used for any extra  
 * work that has to be done to prepare the inputs to the comparison operators.  
 *  
 * 'pathkeys' is a list of sort keys  
 * 'input_cost' is the total cost for reading the input data  
 * 'tuples' is the number of tuples in the relation  
 * 'width' is the average tuple width in bytes  
 * 'comparison_cost' is the extra cost per comparison, if any  
 * 'sort_mem' is the number of kilobytes of work memory allowed for the sort  
 * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound  
 *  
 * NOTE: some callers currently pass NIL for pathkeys because they  
 * can't conveniently supply the sort keys.  Since this routine doesn't  
 * currently do anything with pathkeys anyway, that doesn't matter...  
 * but if it ever does, it should react gracefully to lack of key data.  
 * (Actually, the thing we'd most likely be interested in is just the number  
 * of sort keys, which all callers *could* supply.)  
 */  
cost_sort(Path *path, PlannerInfo *root, List *pathkeys,   
Cost input_cost, double tuples, int width, Cost comparison_cost, int sort_mem, double limit_tuples)  
  
  
/*  
 * cost_append  
 *        Determines and returns the cost of an Append node.  
 *  
 * We charge nothing extra for the Append itself, which perhaps is too  
 * optimistic, but since it doesn't do any selection or projection, it is a  
 * pretty cheap node.  
 */  
cost_append(Path *path, List *subpaths, int num_nonpartial_subpaths)  
  
/*  
 * cost_merge_append  
 *        Determines and returns the cost of a MergeAppend node.  
 *  
 * MergeAppend merges several pre-sorted input streams, using a heap that  
 * at any given instant holds the next tuple from each stream.  If there  
 * are N streams, we need about N*log2(N) tuple comparisons to construct  
 * the heap at startup, and then for each output tuple, about log2(N)  
 * comparisons to replace the top entry.  
 *  
 * (The effective value of N will drop once some of the input streams are  
 * exhausted, but it seems unlikely to be worth trying to account for that.)  
 *  
 * The heap is never spilled to disk, since we assume N is not very large.  
 * So this is much simpler than cost_sort.  
 *  
 * As in cost_sort, we charge two operator evals per tuple comparison.  
 *  
 * 'pathkeys' is a list of sort keys  
 * 'n_streams' is the number of input streams  
 * 'input_startup_cost' is the sum of the input streams' startup costs  
 * 'input_total_cost' is the sum of the input streams' total costs  
 * 'tuples' is the number of tuples in all the streams  
 */  
cost_merge_append(Path *path, PlannerInfo *root, List *pathkeys,   
int n_streams, Cost input_startup_cost, Cost input_total_cost, double tuples)  
  
  
/*  
 * cost_material  
 *        Determines and returns the cost of materializing a relation, including  
 *        the cost of reading the input data.  
 *  
 * If the total volume of data to materialize exceeds work_mem, we will need  
 * to write it to disk, so the cost is much higher in that case.  
 *  
 * Note that here we are estimating the costs for the first scan of the  
 * relation, so the materialization is all overhead --- any savings will  
 * occur only on rescan, which is estimated in cost_rescan.  
 */  
cost_material(Path *path, Cost input_startup_cost,   
Cost input_total_cost, double tuples, int width)  
  
/*  
 * cost_agg  
 *              Determines and returns the cost of performing an Agg plan node,  
 *              including the cost of its input.  
 *  
 * aggcosts can be NULL when there are no actual aggregate functions (i.e.,  
 * we are using a hashed Agg node just to do grouping).  
 *  
 * Note: when aggstrategy == AGG_SORTED, caller must ensure that input costs  
 * are for appropriately-sorted input.  
 */  
cost_agg(Path *path, PlannerInfo *root, AggStrategy aggstrategy,   
const AggClauseCosts *aggcosts, int numGroupCols, double numGroups, Cost input_startup_cost, Cost input_total_cost, double input_tuples)  
  
  
/*  
 * cost_windowagg  
 *              Determines and returns the cost of performing a WindowAgg plan node,  
 *              including the cost of its input.  
 *  
 * Input is assumed already properly sorted.  
 */  
cost_windowagg(Path *path, PlannerInfo *root, List *windowFuncs,   
int numPartCols, int numOrderCols, Cost input_startup_cost, Cost input_total_cost, double input_tuples)  
  
  
/*  
 * cost_group  
 *              Determines and returns the cost of performing a Group plan node,  
 *              including the cost of its input.  
 *  
 * Note: caller must ensure that input costs are for appropriately-sorted  
 * input.  
 */  
cost_group(Path *path, PlannerInfo *root, int numGroupCols, double numGroups,  
                   Cost input_startup_cost, Cost input_total_cost,  
                   double input_tuples)  
  
  
/*  
 * cost_subplan  
 *              Figure the costs for a SubPlan (or initplan).  
 *  
 * Note: we could dig the subplan's Plan out of the root list, but in practice  
 * all callers have it handy already, so we make them pass it.  
 */  
cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)  
  
  
/*  
 * cost_rescan  
 *              Given a finished Path, estimate the costs of rescanning it after  
 *              having done so the first time.  For some Path types a rescan is  
 *              cheaper than an original scan (if no parameters change), and this  
 *              function embodies knowledge about that.  The default is to return  
 *              the same costs stored in the Path.  (Note that the cost estimates  
 *              actually stored in Paths are always for first scans.)  
 *  
 * This function is not currently intended to model effects such as rescans  
 * being cheaper due to disk block caching; what we are concerned with is  
 * plan types wherein the executor caches results explicitly, or doesn't  
 * redo startup calculations, etc.  
 */  
cost_rescan(PlannerInfo *root, Path *path, Cost *rescan_startup_cost,      /* output parameters */  
                        Cost *rescan_total_cost)  
  
  
/*  
 * cost_qual_eval  
 *              Estimate the CPU costs of evaluating a WHERE clause.  
 *              The input can be either an implicitly-ANDed list of boolean  
 *              expressions, or a list of RestrictInfo nodes.  (The latter is  
 *              preferred since it allows caching of the results.)  
 *              The result includes both a one-time (startup) component,  
 *              and a per-evaluation component.  
 */  
cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root)  
  
  
/*  
 * cost_qual_eval_node  
 *              As above, for a single RestrictInfo or expression.  
 */  
cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root)  
  
  
  
cost_qual_eval_walker(Node *node, cost_qual_eval_context *context)        

如何估算每個節點的成本

記得前面提到的接力棒嗎?接力棒裡面包含了rows,這個非常關鍵。

rows是告訴下一個節點,你可能要處理這麼多行。

而有rows是不夠的,還有成本因子,因為每行還可能涉及到操作符的計算、并行worker的成本等。

這些因子的設定如下:

*      seq_page_cost           Cost of a sequential page fetch  
 *      random_page_cost        Cost of a non-sequential page fetch  
 *      cpu_tuple_cost          Cost of typical CPU time to process a tuple  
 *      cpu_index_tuple_cost    Cost of typical CPU time to process an index tuple  
 *      cpu_operator_cost       Cost of CPU time to execute an operator or function  
 *      parallel_tuple_cost     Cost of CPU time to pass a tuple from worker to master backend  
 *      parallel_setup_cost     Cost of setting up shared memory for parallelism        

一些優化器的成本估算例子,可以參考文檔:

https://www.postgresql.org/docs/10/static/planner-stats-details.html

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E4%BB%8E%E6%88%90%E6%9C%AC%E5%A6%82%E4%BD%95%E5%BE%97%E5%88%B0%E6%89%A7%E8%A1%8C%E6%97%B6%E9%97%B4 從成本如何得到執行時間

注意成本是虛化的東西,和時間是不挂鈎的,但是我們可以讓他們挂鈎起來。

這就需要做校準,把成本因子調教成輸出的cost等于執行時間的值。

我在之前發表的文章中提到了如何校準,請參考。

《優化器成本因子校對 - PostgreSQL explain cost constants alignment to timestamp》 《PostgreSQL 10 黑科技 - 自定義統計資訊》

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E5%A6%82%E4%BD%95%E5%9C%A8%E7%94%A8%E6%88%B7%E6%B2%A1%E6%9C%89%E6%95%B0%E6%8D%AE%E7%9A%84%E6%83%85%E5%86%B5%E4%B8%8B%E4%BC%B0%E7%AE%97%E6%80%A7%E8%83%BD 如何在使用者沒有資料的情況下,估算性能

實際上方法很簡單,我們需要業務方提供幾個東西即可:

1、表定義

2、被評估的SQL

3、統計資訊,需要提供我用中文注釋的部分。

View "pg_catalog.pg_stats"  
         Column         |   Type   | Collation | Nullable | Default   
------------------------+----------+-----------+----------+---------  
 schemaname             | name     |           |          |   
 tablename              | name     |           |          |   
 attname                | name     |           |          |   
 inherited              | boolean  |           |          |   
 null_frac              | real     |           |          | 空值比例  
 avg_width              | integer  |           |          | 平均行長度  
 n_distinct             | real     |           |          | 多少唯一值,或唯一值比例,-1表示唯一  
 most_common_vals       | anyarray |           |          | 高頻詞  
 most_common_freqs      | real[]   |           |          | 高頻詞的出現頻率  
 histogram_bounds       | anyarray |           |          | 按記錄數均分為若幹BUCKET的 分位數(列值)  
 correlation            | real     |           |          | 存儲和實際值的線性相關性  
 most_common_elems      | anyarray |           |          | 對于多值類型(數組),元素的高頻詞  
 most_common_elem_freqs | real[]   |           |          | 元素高頻詞出現的頻率  
 elem_count_histogram   | real[]   |           |          | 元素按記錄數均分為若幹BUCKET的 分位數(元素值)        

因為pg_stats支援導出導入,是以不需要實際資料即可完成,postgrespro版本就提供了這樣的功能。

https://postgrespro.com/docs/postgresproee/9.6/dump-stat.html

4、已調教好的成本因子

*      seq_page_cost           Cost of a sequential page fetch  
 *      random_page_cost        Cost of a non-sequential page fetch  
 *      cpu_tuple_cost          Cost of typical CPU time to process a tuple  
 *      cpu_index_tuple_cost    Cost of typical CPU time to process an index tuple  
 *      cpu_operator_cost       Cost of CPU time to execute an operator or function  
 *      parallel_tuple_cost     Cost of CPU time to pass a tuple from worker to master backend  
 *      parallel_setup_cost     Cost of setting up shared memory for parallelism  
int                     effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;        

有以上要素,我們就能通過explain SQL得到估算出來的SQL執行時間。

就可以得到TPS等等。

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E4%BB%8E%E6%89%A7%E8%A1%8C%E6%97%B6%E9%97%B4%E5%A6%82%E4%BD%95%E5%BE%97%E5%88%B0tps 從執行時間如何得到TPS

分為幾種情況

1、CPU是瓶頸時,TPS = 核數*(1秒/執行時間)。

2、IO是瓶頸時,TPS = (磁盤帶寬或IO能力) / (每個query的讀寫吞吐或IO)

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E4%BA%8C%E5%AE%8F%E8%A7%82%E4%BC%B0%E7%AE%97%E6%B3%95 二、宏觀估算法

宏觀估算,通過産品本身的特色來估算。

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#greenplum%E5%92%8Cpostgresql%E4%B8%A4%E4%B8%AA%E4%BA%A7%E5%93%81%E7%9A%84%E7%89%B9%E8%89%B2 Greenplum和PostgreSQL兩個産品的特色

1、RDS PostgreSQL 10 适合以10TB ~ 100TB,OLTP為主,OLAP為輔的場景。與Oracle覆寫的場景非常類似。

相容SQL:2011,百萬+級tpmC。

支援多核并行計算。

支援可讀寫的OSS對象存儲外部表。

支援常用類型、擴充資料類型:JSON(B)、Hstore(KV), PostGIS空間資料庫、pgrouting(路由,圖式搜尋)、數組、ltree樹類型、HLL估值類型, smlar, imgsmlr等。

支援SQL流計算插件

支援時序插件

支援btree, hash, gin, gist, sp-gist, bloom, brin等索引。

支援plpgsql, sql服務端程式設計。

支援分析型文法(多元計算、視窗查詢)、遞歸查詢(樹形查詢、圖式搜尋、等場景)。支援文本全文檢索、模糊查詢、相似查詢、正則查詢。支援數組相似查詢,圖像相似查詢。

1.1 适合業務場景:

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E4%B8%89%E7%B2%BE%E5%87%86%E5%AE%9E%E6%B5%8B%E6%B3%95

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E5%8E%8B%E6%B5%8B%E6%96%B9%E6%B3%95

上面的兩篇文檔中,設計了一個場景,裡面就涉及到如何設計場景,如何設計結構、QUERY、以及測試腳本,壓測。

下面再列舉一個例子。

1、設計表結構

create table a(id int8 primary key, info text, crt_time timestamp);        

2、設計SQL

insert into a values (:id, md5(random()::text), now()) on conflict (id) do update set info=excluded.info, crt_time=excluded.crt_time;        

3、設計測試腳本

pgbench裡面支援多種随機數生成方法,支援sleep來模拟用戶端業務邏輯的處理,支援多線程。具體詳見pgbench文檔。

vi test.sql  
  
\set id random(1,100000000)  
insert into a values (:id, md5(random()::text), now()) on conflict (id) do update set info=excluded.info, crt_time=excluded.crt_time;        

4、壓測(連接配接數、壓測時長)

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 120  
  
transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 32  
number of threads: 32  
duration: 120 s  
number of transactions actually processed: 37100343  
latency average = 0.103 ms  
latency stddev = 0.282 ms  
tps = 309166.975398 (including connections establishing)  
tps = 309180.511436 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         0.001  \set id random(1,100000000)  
         0.103  insert into a values (:id, md5(random()::text), now()) on conflict (id) do update set info=excluded.info, crt_time=excluded.crt_time;        

PostgreSQL測試用戶端pgbench文檔:

https://www.postgresql.org/docs/9.6/static/pgbench.html

工業标準測試

1、tpc-b

PostgreSQL pgbench用戶端自帶的測試模型,就是tpc-b。具體請參考pgbench的幫助文檔,很簡單。

2、pgbench for sysbench

這個測試的是一些mysql流行的場景

《PostgreSQL使用pgbench 測試sysbench 相關case》

3、tpc-c

TPC-C是工業标準的OLTP測試,涉及較多複雜查詢。

《資料庫界的華山論劍tpc.org》

4、linkbenchmark

linkbench是facebook的一個測試模型,用于測試一些圖論相關的寫入和查詢

《facebook linkbench 測試PostgreSQL社交關系圖譜場景性能》

四、一些常見性能名額

【學習資料】第8期PostgreSQL 規格評估 - 微觀、宏觀、精準 多視角估算資料庫性能(選型、做預算不求人)
https://github.com/digoal/blog/blob/master/201203/20120313_01.md https://github.com/digoal/blog/blob/master/201203/20120313_02.md https://github.com/digoal/blog/blob/master/201711/readme.md

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#%E5%B0%8F%E7%BB%93 小結

根據業務的發展,估算資料庫性能,估算需要投入多少硬體,本文提供了三種方法。

當業務開發好後,表結構、QUERY都已經固定了,唯一不固定的是資料。資料可以通過業務方來估算,多少條記錄,有多少唯一值,相關性如何,高頻詞情況如何等等。

結合 成本因子的調教、統計資訊、結構、query,得到每一種QUERY的執行時間。評估達到這樣的TPS需要多少硬體。

宏觀評估,适合選型,因為它隻是多各種産品的特性的總結。

這個可以在業務開發初期就進行評估,而且相對來說比較準确。

根據表結構,業務邏輯,設計測試腳本,根據實際的測試結果,結合業務的發展期望進行評估。

最後,本文還提供了若幹種工業标準測試的方法,以及若幹種已有的測試資料僅供參考。

https://github.com/digoal/blog/blob/master/201709/20170921_01.md#postgresql-%E8%AE%B8%E6%84%BF%E9%93%BE%E6%8E%A5 https://github.com/digoal/blog/issues/76