天天看點

MySQL 8.0 新的火山模型執行器MySQL的總體架構MySQL的AST枝幹(基于8.0.20)MySQL的執行流程對比MySQL的具體執行步驟對比參考資料:

MySQL的總體架構

通常我們認為MySQL的整體架構如下,

MySQL 8.0 新的火山模型執行器MySQL的總體架構MySQL的AST枝幹(基于8.0.20)MySQL的執行流程對比MySQL的具體執行步驟對比參考資料:

官方10年前開始就一直在緻力于優化器代碼的重構工作,目的是能確定在SQL的執行過程中有清晰的階段,包括分離Parse和Resolve階段、更好的支援更多新的文法、保證Name和Type的解析、盡量在Prepare階段做完Transformations,這樣能夠很容易的支援CTEs、Windows函數、LATERAL文法、JSON的Table函數和Windows函數。當然,最重要的重構當數Iterator執行器。這個早期MySQL版本的優化器執行器邏輯:

MySQL 8.0 新的火山模型執行器MySQL的總體架構MySQL的AST枝幹(基于8.0.20)MySQL的執行流程對比MySQL的具體執行步驟對比參考資料:

不過本文還要普及一些MySQL基礎的知識。

MySQL的AST枝幹(基于8.0.20)

首先我們先了解Parse後的AST樹到底是長什麼樣子?先了解重要的兩個結構: SELECT_LEX & SELECT_LEX_UNIT

SELECT_LEX: 代表了SELECT本身,也就是SQL SELECT的關鍵字就會對應一個SELECT_LEX的結構。

SELECT_LEX_UNIT: 代表了帶UNION的一堆SELECT,當然也可以代表一個SELECT。

下面這段是SELECT_LEX_UNIT/SELECT_LEX的類結構中,和查詢層次相關的一些成員變量

class SELECT_LEX_UNIT {
  /**
    Intrusive double-linked list of all query expressions
    immediately contained within the same query block.
  */
  SELECT_LEX_UNIT *next;
  SELECT_LEX_UNIT **prev;
  /**
    The query block wherein this query expression is contained,
    NULL if the query block is the outer-most one.
  */
  SELECT_LEX *master;
  /// The first query block in this query expression.
  SELECT_LEX *slave;
  ......
  /**
    Helper query block for query expression with UNION or multi-level
    ORDER BY/LIMIT
  */
  SELECT_LEX *fake_select_lex
  ......
    /// @return the query block this query expression belongs to as subquery
  SELECT_LEX *outer_select() const { return master; }
  /// @return the first query block inside this query expression
  SELECT_LEX *first_select() const { return slave; }
  /// @return the next query expression within same query block (next subquery)
  SELECT_LEX_UNIT *next_unit() const { return next; }
  ......
}
class SELECT_LEX {
 ......
 private:
  /**
    Intrusive double-linked list of all query blocks within the same
    query expression.
  */
  SELECT_LEX *next;
  SELECT_LEX **prev;
  /// The query expression containing this query block.
  SELECT_LEX_UNIT *master;
  /// The first query expression contained within this query block.
  SELECT_LEX_UNIT *slave;
  /// Intrusive double-linked global list of query blocks.
  SELECT_LEX *link_next;
  SELECT_LEX **link_prev;
  ......
  SELECT_LEX_UNIT *master_unit() const { return master; }
  void set_master(SELECT_LEX_UNIT *src) {  master = src; }
  SELECT_LEX_UNIT *first_inner_unit() const { return slave; }
  SELECT_LEX *outer_select() const { return master->outer_select(); }
  SELECT_LEX *next_select() const { return next; }
  ......
}           

我們來拿一個真實的例子來舉例說明:

(SELECT *
   FROM ttt1)
UNION ALL
  (SELECT *
   FROM
     (SELECT *
      FROM ttt2) AS a,
     (SELECT *
      FROM ttt3
      UNION ALL SELECT *
      FROM ttt4) AS b)           

實際中該查詢的标準記憶體結構就是這樣的,

MySQL 8.0 新的火山模型執行器MySQL的總體架構MySQL的AST枝幹(基于8.0.20)MySQL的執行流程對比MySQL的具體執行步驟對比參考資料:

這裡需要注意的是,MySQL官方的 [WL#5275: Process subqueries in FROM clause in the same way as]

https://dev.mysql.com/worklog/task/?id=5275> view支援可以把子查詢提升到上層查詢中。優化器會調用SELECT_LEX::resolve_placeholder_tables -

SELECT_LEX::merge_derived來避免materialize那些可以提升到上層查詢的子查詢。外部也可以通過set optimizer_switch='derived_merge=on/off'來進行開關,下面來對比下8.0.13和8.0.18對于該優化的執行計劃展現。

8.0.13 derived_merge off vs on
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+---------------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                 |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+---------------------------------------+
|  1 | PRIMARY     | ttt1       | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                  |
|  2 | UNION       | <derived3> | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    2 |   100.00 | NULL                                  |
|  2 | UNION       | <derived4> | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    4 |   100.00 | Using join buffer (Block Nested Loop) |
|  4 | DERIVED     | ttt3       | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                  |
|  5 | UNION       | ttt4       | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                  |
|  3 | DERIVED     | ttt2       | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                  |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+---------------------------------------+
6 rows in set, 1 warning (0.01 sec)
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+---------------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                 |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+---------------------------------------+
|  1 | PRIMARY     | ttt1       | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                  |
|  2 | UNION       | ttt2       | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                  |
|  2 | UNION       | <derived4> | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    4 |   100.00 | Using join buffer (Block Nested Loop) |
|  4 | DERIVED     | ttt3       | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                  |
|  5 | UNION       | ttt4       | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                  |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+---------------------------------------+
5 rows in set, 1 warning (0.02 sec)
8.0.18 derived_merge off vs on
| -> Append
    -> Stream results
        -> Table scan on ttt1  (cost=0.35 rows=1)
    -> Stream results
        -> Inner hash join
            -> Table scan on b
                -> Union materialize
                    -> Table scan on ttt3  (cost=0.35 rows=1)
                    -> Table scan on ttt4  (cost=0.35 rows=1)
            -> Hash
                -> Table scan on a
                    -> Materialize
                        -> Table scan on ttt2  (cost=0.35 rows=1)
 
 | -> Append
    -> Stream results
        -> Table scan on ttt1  (cost=0.35 rows=1)
    -> Stream results
        -> Inner hash join
            -> Table scan on b
                -> Union materialize
                    -> Table scan on ttt3  (cost=0.35 rows=1)
                    -> Table scan on ttt4  (cost=0.35 rows=1)
            -> Hash
                -> Table scan on ttt2  (cost=0.35 rows=1)           

通過優化後,該查詢的記憶體結構就變成這樣了

MySQL 8.0 新的火山模型執行器MySQL的總體架構MySQL的AST枝幹(基于8.0.20)MySQL的執行流程對比MySQL的具體執行步驟對比參考資料:

MySQL的執行流程對比

本文由于不是介紹整個優化器的詳細優化過程,是以我們這裡簡單介紹下優化器的一些步驟和方法,根據MySQL官方網站的介紹我們可以知道具體步驟如下:

handle_select()
   mysql_select()
     JOIN::prepare()
       setup_fields()
     JOIN::optimize()            /* optimizer is from here ... */
       optimize_cond()
       opt_sum_query()
       make_join_statistics()
         get_quick_record_count()
         choose_plan()
           /* Find the best way to access tables */
           /* as specified by the user.          */
           optimize_straight_join()
             best_access_path()
           /* Find a (sub-)optimal plan among all or subset */
           /* of all possible query plans where the user    */
           /* controls the exhaustiveness of the search.   */
           greedy_search()
             best_extension_by_limited_search()
               best_access_path()
           /* Perform an exhaustive search for an optimal plan */
           find_best()
       make_join_select()        /* ... to here */
     JOIN::exec()           

不過這個文檔比較老,沒有來的及更新,其中JOIN::prepare()函數已經在

WL#7082 - Move permanent transformations from JOIN::optimize () to JOIN::prepare (). As one single patch.

增加了SELECT_LEX::prepare代替JOIN::prepare.

另外從8.0.1對整個DML的操作進行了重構,讓結構更加清晰,讓Prepare和Execute過程清楚的分開,這裡隻列出一些和查詢相關的部分,參考

WL#5094: Create SQL command classes for DML statements
MySQL 8.0 新的火山模型執行器MySQL的總體架構MySQL的AST枝幹(基于8.0.20)MySQL的執行流程對比MySQL的具體執行步驟對比參考資料:
Class Sql_cmd_dml
  Sql_cmd_dml::prepare() walks through these common steps:
  precheck() - performs a coarse-grained authorization check of the statement.
  open_tables_for_query() - opens the tables and views mentioned in the statement. Views are expanded so that underlying views and tables are opened too.
  resolve_var_assignments() - resolves variable assignments in the statement.
  prepare_inner() - performs statement-specific preparation of the statement and is implemented for every subclass of Sql_cmd_dml.
  Sql_cmd_dml::execute() walks through these common steps:
  set_statement_timer() - is called if a time limit is applicable to the statement.
  prepare() is called if the statement is a regular (not preparable) statement.
  If prepare() is not called, precheck() and open_tables_for_query() are still called since these actions are required also when executing already prepared statements.
  run_before_dml_hook() is called if the statement is a data change statement, in order to prepare replication actions for the statement.
  An IGNORE or STRICT mode error handler is set up if applicable. It will be active for the duration of the execution of the statement.
  lock_tables() is called, unless the statement affects no rows or produces no rows in any tables.
  Query_cache::store_query() is called to register the statement in the query cache, if applicable.
  execute_inner() - performs statement-specific optimization and execution of the statement. Sql_cmd_dml::execute_inner() is an implementation for all SELECT statements, all INSERT statements that are based on a SELECT and all multi-table UPDATE and DELETE statements (ie all statements that are implemented using a JOIN object). For all other types of DML statements, a separate implementation for execute_inner() exists.
Class Sql_cmd_select
  This is a new class used to implement SELECT statements.
  It has an implementation of prepare_inner() to prepare SELECT statements.
  It uses Sql_cmd_dml::execute_inner() to execute SELECT statements.           

Sql_cmd_dml是LEX的成員變量m_sql_cmd,而lex->m_sql_cmd大部分會在sql/sql_yacc.yy中new出來,是以目前8.0.13版本整個的流程就變成了下面的流程

8.0.13
mysql_execute_command()
  lex->m_sql_cmd->execute()
  Sql_cmd_dml::execute()
    Sql_cmd_dml::prepare()
      Sql_cmd_select::precheck()
      Sql_cmd_select::open_tables_for_query()
      Sql_cmd_select::prepare_inner()
        SELECT_LEX_UNIT::prepare_limit()
        SELECT_LEX_UNIT::prepare() (not simple or simple SELECT_LEX::prepare)
          SELECT_LEX::prepare()
            ......
      Sql_cmd_dml::execute_inner
        SELECT_LEX_UNIT::optimize() (not simple or simple SELECT_LEX::optimize)
          SELECT_LEX::optimize()  
            JOIN::optimize()
            SELECT_LEX_UNIT::optimize()
              ......
        SELECT_LEX_UNIT::execute() (not simple or simple SELECT_LEX::optimize)
          SELECT_LEX::execute()  
            JOIN::exec()
              JOIN::prepare_result()
              do_select()
                sub_select()
                  ......
            SELECT_LEX_UNIT::execute()
              ......
  SELECT_LEX_UNIT::cleanup(false)                  

打開set debug="+d,info,error,query,enter,general,where:O,/tmp/mysqld.trace"可以看到更詳細的執行步驟

T@8: | | | | | | | >do_select
T@8: | | | | | | | | >sub_select
T@8: | | | | | | | | | >innobase_trx_init
T@8: | | | | | | | | | <innobase_trx_init 3269
T@8: | | | | | | | | | >handler::ha_index_init
T@8: | | | | | | | | | | >index_init
T@8: | | | | | | | | | | <index_init 10243
T@8: | | | | | | | | | | >change_active_index
T@8: | | | | | | | | | | | >innobase_get_index
T@8: | | | | | | | | | | | <innobase_get_index 11071
T@8: | | | | | | | | | | <change_active_index 11172
T@8: | | | | | | | | | <handler::ha_index_init 2793
T@8: | | | | | | | | | >handler::ha_index_first
T@8: | | | | | | | | | | >index_first
T@8: | | | | | | | | | | | >index_read
T@8: | | | | | | | | | | | | >row_search_mvcc
T@8: | | | | | | | | | | | | | >row_sel_store_mysql_rec
T@8: | | | | | | | | | | | | | | >row_sel_store_mysql_field_func
T@8: | | | | | | | | | | | | | | <row_sel_store_mysql_field_func 2921
T@8: | | | | | | | | | | | | | <row_sel_store_mysql_rec 3080
T@8: | | | | | | | | | | | | <row_search_mvcc 5881
T@8: | | | | | | | | | | | <index_read 11012
T@8: | | | | | | | | | | <index_first 11308
T@8: | | | | | | | | | <handler::ha_index_first 3293
T@8: | | | | | | | | | >evaluate_join_record
T@8: | | | | | | | | | | enter: join: 0x7fff99d92d68 join_tab index: 0 table: cat cond: (nil)
T@8: | | | | | | | | | | >sub_select_op
T@8: | | | | | | | | | | <sub_select_op 1365           

這裡不在贅述這個stack是因為,我們下面要引入了我們重要的主題部分8.0.18的Iterator執行部分,看看這個與之前的執行有何不同。官方用了很多Worklogs來實作Iterator的執行器。

Plugin README
WL#12074 Volcano iterator executor base 2019-03-31 09:57:10
WL#11785 Volcano iterator design 2019-03-29 13:46:51
WL#12470 Volcano iterator semijoin 2019-06-24 14:41:06
WL#13476 BKA outer/semi/anti join in iterator executor 2020-04-08 12:13:42
WL#13002 BKA in iterator executor 2019-12-20 10:13:33
WL#13000 Iterator UNION 2019-09-16 10:57:16
WL#12788 Iterator executor analytics queries 2019-06-24 14:42:47
WL#4168 Implement EXPLAIN ANALYZE 2019-09-16 10:24:32
WL#13377 Add support for hash outer, anti and semi join 2020-04-08 12:02:15
WL#2241 Hash join 2019-09-16 10:15:21
WL#4245 Subquery optimization: Transform NOT EXISTS and NOT IN to anti-join 2019-06-24 13:12:53

MySQL的具體執行步驟對比

先來了解下術語:

QEP:全稱(Query Execution Plan)查詢執行計劃。

QEP_TAB:全稱(Query Execution Plan Table) 查詢執行計劃表

熟悉我們知道在8.0開始,官方已經慢慢的用Iterator的執行類來替換原有的一些和執行相關的類,是以原有的流程中bool JOIN::optimize(),用于生成一個Query塊的執行計劃(QEP)就增加了生成Iterator的部分。

MySQL 8.0 新的火山模型執行器MySQL的總體架構MySQL的AST枝幹(基于8.0.20)MySQL的執行流程對比MySQL的具體執行步驟對比參考資料:

最終要的JOIN::create_iterators主要分兩個步驟:

1) 通過create_table_iterators,對于每個特定表,生成對應的基本的RowIterators的繼承子類。

2) 通過調用create_root_iterator_for_join,生成組合的iterators,合并每個表的組合行。

然後将生成的iterator指派到JOIN::m_root_iterator。

表通路對比

JOIN::create_table_iterators裡面可以看到需要去輪詢所有的表來建構通路方式,調用了最重要的方法QEP_TAB::make_join_readinfo和QEP_TAB::pick_table_access_method,我們來看看和之前非Iterator通路方式有何不同。

在8.0之前,我們看到QEP_TAB是通過一些函數指針和READ_RECORD來設定通路的函數指針:

class QEP_TAB : public Sql_alloc, public QEP_shared_owner
{
  ......
  READ_RECORD::Setup_func materialize_table;
  /**
     Initialize table for reading and fetch the first row from the table. If
     table is a materialized derived one, function must materialize it with
     prepare_scan().
  */
  READ_RECORD::Setup_func read_first_record;
  Next_select_func next_select;
  READ_RECORD read_record;
  /*
    The following two fields are used for a [NOT] IN subquery if it is
    executed by an alternative full table scan when the left operand of
    the subquery predicate is evaluated to NULL.
  */
  READ_RECORD::Setup_func save_read_first_record;/* to save read_first_record */
  READ_RECORD::Read_func save_read_record;/* to save read_record.read_record */  
  ......
}
struct READ_RECORD
{
  typedef int (*Read_func)(READ_RECORD*);
  typedef void (*Unlock_row_func)(QEP_TAB *);
  typedef int (*Setup_func)(QEP_TAB*);
  
  TABLE *table;                                 /* Head-form */
  Unlock_row_func unlock_row;
  Read_func read_record;
  ......
}
bool init_read_record(READ_RECORD *info, THD *thd,
                      TABLE *table, QEP_TAB *qep_tab,
                      int use_record_cache,
                      bool print_errors, bool disable_rr_cache);
bool init_read_record_idx(READ_RECORD *info, THD *thd, TABLE *table,
                          bool print_error, uint idx, bool reverse);
void end_read_record(READ_RECORD *info);
QEP_TAB::pick_table_access_method設定流程大體如下:
void QEP_TAB::pick_table_access_method(const JOIN_TAB *join_tab) {
  ......
  switch (type())
  {
  case JT_REF:
    if (join_tab->reversed_access)
    {
      read_first_record= join_read_last_key;
      read_record.read_record= join_read_prev_same;
    }
    else
    {
      read_first_record= join_read_always_key;
      read_record.read_record= join_read_next_same;
    }
    break;
  case JT_REF_OR_NULL:
    read_first_record= join_read_always_key_or_null;
    read_record.read_record= join_read_next_same_or_null;
    break;
  case JT_CONST:
    read_first_record= join_read_const;
    read_record.read_record= join_no_more_records;
    read_record.unlock_row= join_const_unlock_row;
    break;
  ......
}           

執行的流程如下:

if (in_first_read)
    {
      in_first_read= false;
      error= (*qep_tab->read_first_record)(qep_tab); //設定合适的讀取函數,如設定索引讀函數/全表掃描函數
    }
    else
      error= info->read_record(info);           

那麼對于第一次QEP_TAB::read_first_record和後續讀指針READ_RECORD::read_record可以為下列函數的實作,其中rr代表read record:

int join_init_quick_read_record(QEP_TAB *tab);
int join_init_read_record(QEP_TAB *tab);
int join_read_first(QEP_TAB *tab);
int join_read_last(QEP_TAB *tab);
int join_read_last_key(QEP_TAB *tab);join_read_next_same
int join_materialize_derived(QEP_TAB *tab);
int join_materialize_semijoin(QEP_TAB *tab);
int join_read_prev_same(READ_RECORD *info);
static int join_read_const(QEP_TAB *tab);
static int read_const(TABLE *table, TABLE_REF *ref);
static int join_read_key(QEP_TAB *tab);
static int join_read_always_key(QEP_TAB *tab);
static int join_no_more_records(READ_RECORD *info);
static int join_read_next(READ_RECORD *info);
static int join_read_next_same(READ_RECORD *info);
static int join_read_prev(READ_RECORD *info);
static int join_ft_read_first(QEP_TAB *tab);
static int join_ft_read_next(READ_RECORD *info);
static int join_read_always_key_or_null(QEP_TAB *tab);
static int join_read_next_same_or_null(READ_RECORD *info);
int rr_sequential(READ_RECORD *info)
static int rr_quick(READ_RECORD *info);
int rr_sequential(READ_RECORD *info);
static int rr_from_tempfile(READ_RECORD *info);
template<bool> static int rr_unpack_from_tempfile(READ_RECORD *info);
template<bool> static int rr_unpack_from_buffer(READ_RECORD *info);
static int rr_from_pointers(READ_RECORD *info);
static int rr_from_cache(READ_RECORD *info);
static int init_rr_cache(THD *thd, READ_RECORD *info);
static int rr_index_first(READ_RECORD *info);
static int rr_index_last(READ_RECORD *info);
static int rr_index(READ_RECORD *info);
static int rr_index_desc(READ_RECORD *info);           

為什麼簡單的流程,需要指定不同的函數指針呢?原因是因為優化器需要根據不同的規則(RBO)和代價(CBO)去設計巧妙的通路方法,比如表掃描、索引掃描、稀疏掃描等等,那麼這樣的組合對于Innodb層提供的簡單接口來說非常複雜。Innodb層和Server層的接口也不會根據上層的變化不斷的修改和增加,是以Server層的執行層,利用自己規定的方法,來進行組合調用。比如我們舉例rr_quick函數。

static int rr_quick(READ_RECORD *info)
{
  int tmp;
  while ((tmp= info->quick->get_next()))
  {
    if (info->thd->killed || (tmp != HA_ERR_RECORD_DELETED))
    {
      tmp= rr_handle_error(info, tmp);
      break;
    }
  }
  return tmp;
}           

rr_quick增加了一個新的優化的類就是QUICK_SELECT_I接口實作的具體優化類,顧名思義就是比表掃描和索引掃描快速的通路方式,目前官方有7種方式,

enum {
    QS_TYPE_RANGE = 0,
    QS_TYPE_INDEX_MERGE = 1,
    QS_TYPE_RANGE_DESC = 2,
    QS_TYPE_FULLTEXT   = 3,
    QS_TYPE_ROR_INTERSECT = 4,
    QS_TYPE_ROR_UNION = 5,
    QS_TYPE_GROUP_MIN_MAX = 6
  };           

我們這裡隻列出大約的調用流程,而非具體每一個實作的QUICK類。

1. Create quick select
    quick= new QUICK_XXX_SELECT(...);
  2. Perform lightweight initialization. This can be done in 2 ways:
  2.a: Regular initialization
    if (quick->init())
    {
      //the only valid action after failed init() call is delete
      delete quick;
    }
  2.b: Special initialization for quick selects merged by QUICK_ROR_*_SELECT
    if (quick->init_ror_merged_scan())
      delete quick;
  3. Perform zero, one, or more scans.
    while (...)
    {
      // initialize quick select for scan. This may allocate
      // buffers and/or prefetch rows.
      if (quick->reset())
      {
        //the only valid action after failed reset() call is delete
        delete quick;
        //abort query
      }
       // perform the scan
      do
      {
        res= quick->get_next();
      } while (res && ...)
    }
  4. Delete the select:
    delete quick;                

顯然,rr_quick仍然是執行路徑分類下的又一個複雜的路由函數,根據實際READ_RECORD::quick的具體QUICK class來決定剩餘的邏輯,那如何對應到Innodb存儲的具體函數呢?拿QUICK_RANGE_SELECT這個類來舉例,參照如下調用stack:

#x  ha_index_first/ha_index_read_map or ha_index_next_same/ha_index_next
#0  handler::read_range_first or handler::read_range_next
#1  handler::multi_range_read_next (this=0x7f9a78080900, range_info=0x7f9adc38bd40)
#2  DsMrr_impl::dsmrr_next (this=0x7f9a78082628, range_info=0x7f9adc38bd40)
#3  ha_innobase::multi_range_read_next (this=0x7f9a78080900, range_info=0x7f9adc38bd40)
#4  QUICK_RANGE_SELECT::get_next (this=0x7f9a7807b220)
#5  rr_quick (info=0x7f9a78103dd8)
#6  join_init_read_record (tab=0x7f9a78103d48) 
#7  sub_select (join=0x7f9a78005bd8, join_tab=0x7f9a78103d48, end_of_records=false)
#8  do_select (join=0x7f9a78005bd8)
#9  JOIN::exec (this=0x7f9a78005bd8)            

現在回到了我們的8.0 Iterator執行器中,我們看到READ_RECORD m_read_record_info将被unique_ptr_destroy_only m_iterator所代替,包括setup_read_record(), init_read_record() and setup_read_record_idx()都将被各種各樣的Iterator代替。在Iterator的執行器下,不用關心函數指針的指派,也不需要有兩個QEP_TAB::read_first_record和後續讀指針READ_RECORD::read_record,隻需要實作RowIterator的子類并實作其定義的接口。

class RowIterator {
......
  virtual bool Init() = 0;
  virtual int Read() = 0;
  virtual void SetNullRowFlag(bool is_null_row) = 0;
  virtual void UnlockRow() = 0;
......
}           

詳細可以檢視官方的link:

https://dev.mysql.com/doc/dev/mysql-server/latest/classTableRowIterator.html https://dev.mysql.com/doc/dev/mysql-server/latest/classRowIterator.html

QEP_TAB::pick_table_access_method設定流程變為了下面的方式:

void QEP_TAB::pick_table_access_method() {
......
   switch (type()) {
     case JT_REF:
       if (is_pushed_child) {
         DBUG_ASSERT(!m_reversed_access);
         iterator = NewIterator<PushedJoinRefIterator>(
             join()->thd, table(), &ref(), use_order(), &join()->examined_rows);
       } else if (m_reversed_access) {
         iterator = NewIterator<RefIterator<true>>(join()->thd, table(), &ref(),
                                                   use_order(), this,
                                                   &join()->examined_rows);
       } else {
         iterator = NewIterator<RefIterator<false>>(join()->thd, table(), &ref(),
                                                    use_order(), this,
                                                    &join()->examined_rows);
       }
       used_ref = &ref();
       break;
     case JT_REF_OR_NULL:
       iterator = NewIterator<RefOrNullIterator>(join()->thd, table(), &ref(),
                                                 use_order(), this,
                                                 &join()->examined_rows);
       used_ref = &ref();
       break;
     case JT_CONST:
       iterator = NewIterator<ConstIterator>(join()->thd, table(), &ref(),
                                             &join()->examined_rows);
       break;
     case JT_EQ_REF:
       if (is_pushed_child) {
         iterator = NewIterator<PushedJoinRefIterator>(
             join()->thd, table(), &ref(), use_order(), &join()->examined_rows);
       } else {
         iterator = NewIterator<EQRefIterator>(
             join()->thd, table(), &ref(), use_order(), &join()->examined_rows);
       }
       used_ref = &ref();
       break;
   ......    
     case JT_ALL:
     case JT_RANGE:
     case JT_INDEX_MERGE:
       if (using_dynamic_range) {
         iterator = NewIterator<DynamicRangeIterator>(join()->thd, table(), this,
                                                      &join()->examined_rows);
       } else {
         iterator =
             create_table_iterator(join()->thd, nullptr, this, false,
                                   /*ignore_not_found_rows=*/false,
                                   &join()->examined_rows, &m_using_table_scan);
       }
       break;
   ...... 
}           

執行的流程變成了:

unique_ptr<RowIterator> iterator(new ...);
  if (iterator->Init())
    return true;
  while (iterator->Read() == 0) {
    ...
  }           

Join通路對比

MySQL 的 join 操作主要是采用NestLoop的算法,其中涉及的主要函數有如下 do_select()、sub_select()、evaluate_join_record(),當然還有BNL和BKA等等,我們就不再這裡贅述。

static int do_select(JOIN *join)
{
    ... ...
    if (join->plan_is_const() && !join->need_tmp) {
        ... ...
    } else {
      QEP_TAB *qep_tab= join->qep_tab + join->const_tables;
      DBUG_ASSERT(join->primary_tables);
      error= join->first_select(join,qep_tab,0);   ← 非結束選取
      if (error >= NESTED_LOOP_OK)
        error= join->first_select(join,qep_tab,1); ← 結束選取
    }
}
enum_nested_loop_state sub_select(JOIN *join,JOIN_TAB *join_tab,bool end_of_records)
{
  ... ...
  if (end_of_records)
  {
    enum_nested_loop_state nls=
      (*join_tab->next_select)(join,join_tab+1,end_of_records); ← 一般是sub_select/最後一個是end_send/end_send_group
    DBUG_RETURN(nls);
  }
  READ_RECORD *info= &join_tab->read_record;
  ... ...
  while (rc == NESTED_LOOP_OK && join->return_tab >= qep_tab_idx)
  {
    int error;
    if (in_first_read)                         ← 讀取第一條記錄
    {
      in_first_read= false;
      error= (*qep_tab->read_first_record)(qep_tab);
    }
    else
      error= info->read_record(info);          ← 循環讀取記錄直到結束位置
......
      rc= evaluate_join_record(join, qep_tab); ← 評估是否符合條件,連接配接下一個表
  }
  ... ...
}
static enum_nested_loop_state
evaluate_join_record(JOIN *join, JOIN_TAB *join_tab)
{
    ... ...
    Item *condition= join_tab->condition();   ← 查詢條件
    bool found= TRUE;
    ... ...
    if (condition)
    {
      found= MY_TEST(condition->val_int());   ← 評估是否符合條件
    ... ...
    
    if (found)
    {
      enum enum_nested_loop_state rc;
      /* A match from join_tab is found for the current partial join. */
      rc= (*join_tab->next_select)(join, join_tab+1, 0);
      join->thd->get_stmt_da()->inc_current_row_for_warning();
      if (rc != NESTED_LOOP_OK)
        DBUG_RETURN(rc);
    ... ...
}           

那麼整個執行的流程串起來就是:

JOIN::exec()                              ← 執行一個Query Block
  |-THD_STAGE_INFO()                      ← 設定線程的狀态為executing
  |-set_executed()                        ← 設定為執行狀态,JOIN::executed=true
  |-prepare_result()
  |-send_result_set_metadata()            ← 先将中繼資料發送給用戶端
  |
  |-do_select()                           ←### 查詢的實際入口函數,做JOIN操作,會傳回給用戶端或寫入表
    |
    |-join->first_select(join,qep_tab,0)  ← 1. 執行nest loop操作,預設會調用sub_select()函數,
    | |                                   ←    也即循環調用rnd_next()+evaluate_join_record()
    | |
    | |###while循環讀取資料###
    | |                                   ← 2. 調用存儲引擎接口讀取資料
    | |-qep_tab->read_first_record()      ← 2.1. 首次調用,實際為join_init_read_record()
    | | |-tab->quick()->reset()           ← 對于quick調用QUICK_RANGE_SELECT::reset()函數
    | | | |-file->ha_index_init()         ← 會調用存儲引擎接口
    | | | | |-index_init()
    | | | |   |-change_active_index()
    | | | |     |-innobase_get_index()
    | | | |-file->multi_range_read_init()
    | | |-init_read_record()              ← 設定read_record指針,在此為rr_quick
    | |
    | |-info->read_record()               ← 2.2 再次調用,如上,該函數在init_read_record()中初始化
    | | |-info->quick->get_next()         ← 實際調用QUICK_RANGE_SELECT::get_next()
    | |   |-file->multi_range_read_next() ← 調用handler.cc檔案中函數
    | |     |-read_range_first()          ← 對于第一次調用
    | |     | |-ha_index_read_map()       ← 存儲引擎調用
    | |     |   |-index_read()
    | |     |     |-row_search_mvcc()
    | |     |
    | |     |-read_range_next()           ← 對于非第一次調用
    | |       |-ha_index_next()
    | |         |-general_fetch()
    | |           |-row_search_mvcc()
    | |
    | |-evaluate_join_record()            ← 2.3 處理讀取的記錄,判斷是否滿足條件,包括了第一條記錄
    |   |-qep_tab->next_select()          ← 對于查詢,實際會調用end_send()
    |     |-Query_result_send::send_data()
    |
    |-join->first_select(join,qep_tab,1)  ← 3. 一個table已經讀取資料結束,同樣預設調用sub_select()
    | |-join_tab->next_select()           ← 調用該函數處理下個表或者結束處理
    |
    |-join->select_lex->query_result()->send_eof()           

這次我們要對比下新的執行引擎的變化,既然表的通路方式已經從函數指針變為Iterator的Init/Read兩個接口,我們來看其實對于Iterator引擎更容易了解了,JOIN::create_table_iterators本身就可以構造出簡單的Iterators結構:

| -> Limit: 1 row(s)
    -> Sort: ttt1.c1, limit input to 1 row(s) per chunk  (cost=0.45 rows=2)
        -> Filter: (ttt1.c1 > 2)
            -> Table scan on ttt1           

而JOIN::create_root_iterator_for_join可以構造出更為标準的Iterator火山模型結構:

| -> Limit: 1 row(s)
    -> Sort: ttt1.c1, limit input to 1 row(s) per chunk
        -> Stream results
            -> Inner hash join (ttt1.c1 = ttt2.c1)  (cost=0.90 rows=1)
                -> Table scan on ttt1  (cost=0.45 rows=2)
                -> Hash
                    -> Filter: (ttt2.c1 > 0)  (cost=0.35 rows=1)
                        -> Table scan on ttt2  (cost=0.35 rows=1)
 |           

create_root_iterator_for_join中最為重要的函數ConnectJoins,裡面負責生成相應的Semijoin/Hashjoin/Antijoin/Nestloopjoin等等的組合的Iterator。因為Hashjoin另有篇幅介紹,這裡舉例來說NestLoopIterator的實作:

/**
  A simple nested loop join, taking in two iterators (left/outer and
  right/inner) and joining them together. This may, of course, scan the inner
  iterator many times. It is currently the only form of join we have.
  The iterator works as a state machine, where the state records whether we need
  to read a new outer row or not, and whether we've seen any rows from the inner
  iterator at all (if not, an outer join need to synthesize a new NULL row).
  The iterator takes care of activating performance schema batch mode on the
  right iterator if needed; this is typically only used if it is the innermost
  table in the entire join (where the gains from turning on batch mode is the
  largest, and the accuracy loss from turning it off are the least critical).
 */
class NestedLoopIterator final : public RowIterator {
  ......
  bool Init() override;
  int Read() override;
  ......
  
 private:  
  ......
  
  unique_ptr_destroy_only<RowIterator> const m_source_outer; ← 外表
  unique_ptr_destroy_only<RowIterator> const m_source_inner; ← 内表
  const JoinType m_join_type;                                ← 連接配接方式
}
bool NestedLoopIterator::Init() {
  if (m_source_outer->Init()) {                              ← 外表初始化
    return true;
  }
  m_state = NEEDS_OUTER_ROW;                                 ← 先掃描外表
......  
  return false;
}
int NestedLoopIterator::Read() {
  if (m_state == END_OF_ROWS) {
    return -1;
  }
  for (;;) {  // Termination condition within loop.
    if (m_state == NEEDS_OUTER_ROW) {                       ← 開始掃描
      int err = m_source_outer->Read();                     ←  掃描外表
      if (err == 1) {
        return 1;  // Error.
      }
      if (err == -1) {
        m_state = END_OF_ROWS;
        return -1;
      }
......
      // Init() could read the NULL row flags (e.g., when building a hash
      // table), so unset them before instead of after.
      m_source_inner->SetNullRowFlag(false);
      if (m_source_inner->Init()) {                         ← 開始内表初始化
        return 1;
      }
      m_state = READING_FIRST_INNER_ROW;                    ← 掃描第一行内表
    }
    DBUG_ASSERT(m_state == READING_INNER_ROWS ||
                m_state == READING_FIRST_INNER_ROW);
    int err = m_source_inner->Read();                       ← 掃描内表
......
    if (err == -1) {
      // Out of inner rows for this outer row. If we are an outer join
      // and never found any inner rows, return a null-complemented row.
      // If not, skip that and go straight to reading a new outer row.
      if ((m_join_type == JoinType::OUTER &&                ← 内表沒有rows
           m_state == READING_FIRST_INNER_ROW) ||
          m_join_type == JoinType::ANTI) {                  ← 内表直接傳回NULL
        m_source_inner->SetNullRowFlag(true);
        m_state = NEEDS_OUTER_ROW;
        return 0;
      } else {
        m_state = NEEDS_OUTER_ROW;                          ← 否則繼續掃描外表
        continue;
      }
    }
    // An inner row has been found.                         ← 内表傳回row
    if (m_join_type == JoinType::ANTI) {
      // Anti-joins should stop scanning the inner side as soon as we see
      // a row, without returning that row.
      m_state = NEEDS_OUTER_ROW;                            ← Anti join隻需要一行
      continue;
    }
    // We have a new row. Semijoins should stop after the first row;
    // regular joins (inner and outer) should go on to scan the rest.
    if (m_join_type == JoinType::SEMI) {
      m_state = NEEDS_OUTER_ROW;                            ← Semi join隻需要一行
    } else {
      m_state = READING_INNER_ROWS;                         ← 否則繼續循環讀内表
    }
    return 0;
  }
}           

最後我們用Hashjoin看下新的執行流程吧:

SELECT_LEX_UNIT::execute()                    ← 執行一個Query Unit
 |-SELECT_LEX_UNIT::ExecuteIteratorQuery
  |-THD_STAGE_INFO()                          ← 設定線程的狀态為executing
  |-query_result->start_execution(thd)        ← 設定為執行狀态,Query result execution_started = true;
  |-query_result->send_result_set_metadata()  ← 先将中繼資料發送給用戶端
  |-set_executed();                           ← Unit executed = true;
  |
  |-m_root_iterator->Init()                   ← 所有Iterator遞歸Init,此處Iterator為HashJoinIterator
  | |-HashJoinIterator::Init()
  | | |-TableScanIterator::Init()
  | | | |-handler::ha_rnd_init()
  |
  | | |-HashJoinIterator::BuildHashTable()      
  | | | |-TableScanIterator::Read()
  | | | | |-handler::ha_rnd_next()
  | | | | | |-ha_innobase::rnd_next()
  |
  | |-HashJoinIterator::InitProbeIterator()
  | | |-TableScanIterator::Init()
  | | | |-handler::ha_rnd_init()
  | | | | |-ha_innobase::rnd_init()
  |
  | ###while循環讀取資料###
  |-m_root_iterator->Read()                   ← 所有Iterator遞歸Read,此處Iterator為HashJoinIterator  
  | |-HashJoinIterator::Read()
  | |-HashJoinIterator::ReadRowFromProbeIterator()
  | | |-TableScanIterator::Read()
  | | | |-handler::ha_rnd_next()
  | | | | |-ha_innobase::rnd_next()
  |
  |-query_result->send_eof()           

參考資料:

https://dev.mysql.com/doc/internals/en/select-structure.html https://dev.mysql.com/doc/internals/en/optimizer-code.html https://jin-yang.github.io/post/mysql-executor.html