Are you using MySQL indexes correctly?

Guide

Based on the problems encountered, this paper analyzes the knowledge related to the selection of TDDL optimizer, MySQL index, and table sharding key.

background

Recently, due to the need for multi-tenant transformation of applications, monitoring has been tightened. Some problems with the application were found, and the success rate of the HSF Consumer of the application was not always 100, and occasionally 99.99 appeared.

Navigate to a specific statement

(
  select
    `iop_xxx_msg`.`id`,
    `iii_sss_msg`.`message_id`,
    `iii_sss_msg`.`title`,
    `iii_sss_msg`.`content`,
    `iii_sss_msg`.`id_seller`,
    `iii_sss_msg`.`id_user`,
    `iii_sss_msg`.`gmt_create`,
    `iii_sss_msg`.`gmt_modified`,
    `iii_sss_msg`.`is_read`,
    `iii_sss_msg`.`category`,
    `iii_sss_msg`.`sub_category`,
    `iii_sss_msg`.`description`,
    `iii_sss_msg`.`need_side_notice`,
    `iii_sss_msg`.`link_url`,
    `iii_sss_msg`.`btn_name`,
    `iii_sss_msg`.`gmt_create_l`,
    `iii_sss_msg`.`mobile_content`,
    `iii_sss_msg`.`tier`,
    `iii_sss_msg`.`requirement_id`,
    `iii_sss_msg`.`fk_template_id`,
    `iii_sss_msg`.`business_part`,
    `iii_sss_msg`.`business_id`
  from
    `iii_sss_msg_29` `iii_sss_msg`
  WHERE
    (
      (
        (
          `iii_sss_msg`.`gmt_create` >= '2023-07-24 00:00:00'
        )
        AND (
          `iii_sss_msg`.`gmt_create` < '2023-07-31 15:46:45.684'
        )
        AND (`iii_sss_msg`.`id_user` = 500173482096)
        AND (`iii_sss_msg`.`tier` IN ('S', 'A'))
        AND (
          `iii_sss_msg`.`sub_category` IN (1000305, 1000306, 1000501, 1000502)
        )
      )
      OR (
        (`iii_sss_msg`.`category` IN (10003, 10005))
        AND (
          `iii_sss_msg`.`gmt_create` >= '2023-07-24 00:00:00'
        )
        AND (
          `iii_sss_msg`.`gmt_create` < '2023-07-31 15:46:45.684'
        )
        AND (`iii_sss_msg`.`id_user` = ***)
        AND (
          `iii_sss_msg`.`sub_category` IN (1000305, 1000306, 1000501, 1000502)
        )
      )
    )
  order by
    `iii_sss_msg`.`gmt_create` desc
  limit
    0, 5
)union all ...

Statements are more complex, and generally speaking, the meaning of SQL statements is to find the last five messages of a certain category or tier='S' of a merchant within seven days. The iii_sss_msg table is divided into 31 tables based on the creation time, and the routing rules are as follows:

<property name="tbRuleArray">
    <value>"iii_sss_msg_" + getCalendar(#gmt_create,1_date,31#).get(Calendar.DAY_OF_MONTH)</value>
  </property>

So the query connects seven tables iii_sss_msg_29, iii_sss_msg_30.... (This table sharding rule is extremely unreasonable, which will be analyzed later in this article)

Let's take a look at what the xml file for Mybatis looks like.

<select id="selectByQuery" resultMap="webMsgResultMap" parameterType="map" >
        select
        <include refid="Base_Column_List" />
        from iii_sss_msg
        where
        id_user = #{userId}
        <if test="startTime != null">
             and gmt_create <![CDATA[>=]]> #{startTime}
        </if>
        <if test="endTime != null">
           and gmt_create <![CDATA[<]]> #{endTime}
        </if>
       ...

It was found that the ordering of the where conditions of the SQL statement corresponding to mybatis is different from the SQL statement we saw at the end. What changed this statement?

TDDL optimizer

Our application uses TDDL, which will be changed by TDDL after the SQL is generated by mybatis, the workflow of TDDL is as follows, and mySQL will be parsed and optimized at the Matrix layer. (For detailed TDDL knowledge, please refer to the relevant knowledge by yourself)

The question was answered, and it was TDDL that changed the statement in order to optimize the query. Has that been achieved?

Continuing with the analysis, the index information for the table is as follows:

The table is found to have an index named idx_user, which is a joint index of the four fields of gmt_create, id_user, category, sub_category. These fields happen to be in the WHERE condition of the slow SQL statement above. It looks like TDDL adjusts the order of the SQL statements in order to take advantage of that index.

We analyze this statement with an execution plan, and the result is as follows:

The index used for discovery is not idx_user (gmt_create, id_user, category, sub_category), but idx_uer_query (id_user, category). It looks like TDDL is having its own way. So why does MySQL choose idx_user_query (id_user, category) indexes?

To verify this issue we use:

force index（idx_user（gmt_create,id_user,

category, sub_category)) to enforce the use of idx_user indexes.

Comparing the two execution plans, the number of scanned rows indexed by idx_uer_query (id_user, category) is 13948, but the number of rows is idx_user

（gmt_create,id_user,category,sub_category）

The number of scanned rows of the index is 1552218, and the number of scanned rows is more than 100 times that of the former! This index is lonely, but fortunately MySQL did not listen to TDDL's slander. So how does mysql know that it would be better to use idx_user indexes?

How MySQL selects indexes

Choosing an index is the optimizer's job. The purpose of the optimizer index selection is to find an optimal execution scheme and execute the statement with the minimum cost. Optimizers are considered based on the following criteria:

1. Criteria in the query statement: MySQL selects the most appropriate index based on the criteria in the query statement to locate the rows that meet the conditions as quickly as possible. If your query criteria include multiple columns, you might consider building a multi-column index to be more efficient when matching.

2. Index selectivity (cardinality): MySQL will select the most appropriate index based on the selectivity of the index. Selectivity is the ratio of the number of distinct values in an index column to the total number of records in the table. The higher the selectivity, the more efficient the index.

3. Index size and data type: The size and data type of the index also affect the choice of index. Smaller indexes are generally more efficient than larger indexes, and different data types can also affect the efficiency of indexes.

4. Block size: MySQL uses blocks (or pages) to store index data and table data, and the size of the blocks also affects index selection. Smaller data blocks increase the efficiency of caching and reduce the number of disk I/O operations.

5. Index coverage: If a query can be satisfied with a covered index, MySQL will generally prefer to use a covered index. A covered index means that all the columns required in the query statement are included in the index, and there is no need to read data from the table.

Of course, it would be better to choose idx_user (gmt_create, id_user, category, sub_category) according to points 1 and 5, but what about the second point?

We use:

SHOW INDEX FROM `iii_sss_msg_29`

Analyze the cardinality of two indexes:

It was found that the base number of the gmt_create column of the idx_user (gmt_create, id_user, category, sub_category) was 99933, and the base of the id_user of idx_uer_query (id_user, category) was 286528 better than the former.

All things considered, the optimizer chooses idx_uer_query (id_user, category) indexes.

Does MySQL Choose the Wrong Index?

The cardinality difference between the two indexes we analyzed above is not very large (the optimizer may also be incorrectly estimated), however idx_user (gmt_create, id_user, category, sub_category) indexes match better in the conditions in the query statement and use the index idx_user (gmt_create, id_user, category, sub_ category) can avoid sorting (idx_user(gmt_create,id_user,category,sub_category) itself is an index, which is already ordered, if you choose an index idx_user (gmt_create, id_user, category, sub_ category), you don't need to do sorting, you just need to traverse), so even if the number of rows scanned is large, the cost is judged to be smaller. Could this be the case?

After multiple executions, I found that the optimizer used idx_user indexes this time!The tddl optimizer and the index idx_user (gmt_create,id_user,category,sub_category) should be combined to make the optimizer choose the wrong index. The execution scheme of more than one million scanned lines was selected.

Use indexes

idx_user (gmt_create, id_user, category, sub_category) needs to be executed for 995 milliseconds:

It takes 95 milliseconds to use the index idx_uer_query (id_user, category):

This may be the reason for the slow SQL: the TDDL optimizer has chosen the wrong index.

How do I build an index?

Let's take a look at the following index:

idx_user (gmt_create, id_user, category, sub_category) Why is this index created? Probably because of business requirements, there are many queries with the above conditions. But is that indexing reasonable?

The above figure is a B+ tree structure diagram of the idx_user (gmt_create, id_user, category, sub_category) index drawn by the author. It can be seen that the B+ tree only maintains the order of the first column gmt_create, and the order of the other fields is not maintained. According to the table sharding rules of the table, the creation time of the iii_sss_msg_29 is between (2023-07-29 00:00:00-2023-07-29 23:59:59).

Combined with the where condition of the previous SQL statement:

WHERE
    (
      (
        (
          `iii_sss_msg`.`gmt_create` >= '2023-07-24 00:00:00'
        )
        AND (
          `iii_sss_msg`.`gmt_create` < '2023-07-31 15:46:45.684'
        )

It can be seen that if you use this index, idx_user all leaf nodes will be scanned! Previously, when you were forced to use this index to query, more than 1 million rows were scanned, and the cardinality of the primary key ID index was more than 3 million, which is the same magnitude.

idx_user (gmt_create, id_user, category, sub_category) indexes are not reasonable in the above SQL application scenarios, especially in the case of table sharding rules based on creation time. The author changed it to idx_user (id_user, category, sub_category, gmt_create, is_read) (because whether or not it has been read is also common in the business conditions, so it is added). After the change, I checked the execution plan of the SQL statement again and found that the optimizer firmly selected the idx_user (id_user, category, sub_category, gmt_create, is_read) index. This time, there were just over 10,000 lines scanned. Compared with idx_uer_query (id_user, category) indexes, this index can also reduce the number of table returns.

MySQL indexes cannot be created completely when the table is first created. Indexes are used to speed up data queries and are strongly related to business scenarios. Therefore, the index should be changed accordingly according to the business query. Poor indexing can hinder queries and mislead optimizers.

How to select a table sharding key?

Is it reasonable to shard a table iii_sss_msg a table based on its gmt_create?

The key points for shaving keys should be considered:

1. It can achieve the maximum degree of uniform distribution of data to each physical table, that is, it can achieve load balancing and achieve uniform splitting, in fact, the best thing is to self-increment primary key modulo. However, it may not be possible to achieve an absolute uniform distribution if the primary key is not auto-incrementing, or even a number.

2. The split key cannot be empty, but must have an index.

3. As much as possible, this field can be brought after all the WHERE statements of query SQL can be implemented, and if it can't be done, there should be a way to route to a specific physical table.

4. Be careful to split a series of related data into the same table.

Checking the messages of some categories of a merchant within seven days is the most common scenario of our application, but the iii_sss_msg message table is divided according to the gmt_create, and the joint table query is frequently triggered, and each query needs to get data from seven tables, so if you want to see the news within a month in the future, you have to get data from thirty tables. In this case, the split key is extremely unreasonable. Based on the business scenario, it is better to use the id_user as the split key. (This change is time-consuming, and the QAQ will be resolved at a later stage)

summary

Based on the problems encountered, this paper analyzes the knowledge related to the selection of TDDL optimizer, MySQL index, and table sharding key, and summarizes the knowledge as follows:

1. tlld will parse and optimize mysql at the matrix layer.

2. Choosing an index is the optimizer's job. The purpose of the optimizer index selection is to find an optimal execution scheme and execute the statement with the minimum cost. The optimizer is based on the following criteria: the criteria in the query statement, the selectivity (cardinality) of the index, the size and data type of the index, the size of the data block, and the coverage of the index.

3. The MySQL optimizer may select the wrong index. In the case of the optimizer's misjudgment, you can use force index to force the index on the application side, or you can modify the statement to guide the optimizer, and you can also bypass this problem by adding or removing the index.

4. The key points of the selection of the sharding key should be considered: the data can be evenly distributed to each physical sharding table to the greatest extent, the sharding key cannot be empty, there must be an index, this field can be carried after the where statement of all query sql statements can be implemented as much as possible, and it is necessary to pay attention to splitting a series of related data into the same table.

If you want to know more about the above knowledge, you can consult the relevant information.

Author: Cheng Cheng

Source-WeChat public account: Alibaba Cloud Developer

Source: https://mp.weixin.qq.com/s/bxqw0Arey3Qia4jjmgmCtQ