天天看點

使用MySQL 5.7虛拟列提高查詢效率

在這篇部落格中,我們将看看如何使用MySQL 5.7的虛拟列來提高查詢性能。

In this blog post, we’ll look at ways you can use MySQL 5.7 generated columns (or virtual columns) to improve query performance.

說明

大約兩年前,我發表了一個在MySQL5.7版本上關于虛拟列的文章。從那時開始,它成為MySQL5.7發行版當中,我最喜歡的一個功能點。原因很簡單:在虛拟列的幫助下,我們可以建立間接索引(fine-grained indexes),可以顯著提高查詢性能。我要告訴你一些技巧,可以潛在地解決那些使用了GROUP BY 和 ORDER BY而慢的報表查詢。

About two years ago I published a blog post about Generated (Virtual) Columns in MySQL 5.7. Since then, it’s been one of my favorite features in the MySQL 5.7 release. The reason is simple: with the help of virtual columns, we can create fine-grained indexes that can significantly increase query performance. I’m going to show you some tricks that can potentially fix slow reporting queries with GROUP BY and ORDER BY.

問題

最近我正在協助一位客戶,他正掙紮于這個查詢上:

Recently I was working with a customer who was struggling with this query:

這個查詢運作了一個多小時,并且使用和撐滿了整個 tmp目錄(需要用到臨時檔案完成排序)。

The query was running for more than an hour and used all space in the tmp directory (with sort files).

表結構如下:

我們發現查詢沒有使用時間戳字段(“TS”)的索引:

We found out the query was not using an index on the timestamp field (“ts”):

原因很簡單:符合過濾條件的行數太大了,以至于影響一次索引掃描掃描的效率(或者至少優化器是這樣認為的):

The reason for that is simple: the number of rows matching the filter condition was too large for an index scan to be efficient (or at least the optimizer thinks that):

總行數:21998514。查詢需要掃描的總行數的36%(7948800/21998514)(譯者按:當預估掃描行數超過20% ~ 30%時,即便有索引,優化器通常也會強制轉成全表掃描)。

Total number of rows: 21998514. The query needs to scan 36% of the total rows (7948800 / 21998514).

在這種情況下,我們有許多處理方法:

建立時間戳列和GROUP BY列的聯合索引;

建立一個覆寫索引(包含所有查詢字段);

僅對GROUP BY列建立索引;

建立索引松散索引掃描。

In this case, we have a number of approaches:

Create a combined index on timestamp column + group by fields

Create a covered index (including fields that are selected)

Create an index on just GROUP BY fields

Create an index for loose index scan

然而,如果我們仔細觀察查詢中“GROUP BY”部分,我們很快就意識到,這些方案都不能解決問題。以下是我們的GROUP BY部分:

However, if we look closer at the “GROUP BY” part of the query, we quickly realize that none of those solutions will work. Here is our GROUP BY part:

這裡有兩個問題:

它是計算列,是以MySQL不能掃描verb + url的索引。它首先需要連接配接兩個字段,然後組成連接配接字元串。這就意味着用不到索引;

URL被定義為“varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL”,不能被完全索引(即使在全innodb_large_prefix= 1 參數設定下,這是UTF8啟用下的預設參數)。我們能做部分索引,這對GROUP BY的sql優化并沒有什麼幫助。

There are two problems here:

It is using a calculating field, so MySQL can’t just scan the index on verb + url. It needs to first concat two fields, and then group on the concatenated string. That means that the index won’t be used.

The URL is declared as “varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL” and can’t be indexed in full (even with innodb_large_prefix=1 option, which is the default as we have utf8 enabled). We can only do a partial index, which won’t be helpful for GROUP BY optimization.

在這裡,我嘗試去對URL列添加一個完整的索引,在innodb_large_prefix=1參數下:

嗯,通過修改“GROUP BY CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”為 “GROUP BY verb, url” 會幫助(假設我們把字段定義從 varchar(3000)調小一些,不管業務上允許或不允許)。然而,這将改變結果,因URL字段不會删除 .xml擴充名了。

Well, changing the “GROUP BY CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))” to “GROUP BY verb, url” could help (assuming that we somehow trim the field definition from varchar(3000) to something smaller, which may or may not be possible). However, it will change the results as it will not remove the .xml extension from the URL field.

解決方案

好消息是,在MySQL 5.7中我們有虛拟列。是以我們可以在“CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”之上建立一個虛拟列。最好的部分:我們不需要執行一組完整的字元串(可能大于3000位元組)。我們可以使用MD5哈希(或更長的哈希,例如SHA1 / SHA2)作為GROUP BY的對象。

The good news is that in MySQL 5.7 we have virtual columns. So we can create a virtual column on top of “CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”. The best part: we do not have to perform a GROUP BY with the full string (potentially > 3000 bytes). We can use an MD5 hash (or longer hashes, i.e., sha1/sha2) for the purposes of the GROUP BY.

下面是解決方案:

是以我們在這裡做的是:

聲明虛拟列,類型為varbinary(16);

在CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”)上建立虛拟列,并且使用MD5哈希轉化後再使用unhex轉化32位十六進制為16位二進制;

對上面的虛拟列建立索引。

So what we did here is:

Declared the virtual column with type varbinary(16)

Created a virtual column on CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”), and used an MD5 hash on top plus an unhex to convert 32 hex bytes to 16 binary bytes

Created and index on top of the virtual column

現在我們可以修改查詢語句,GROUP BY verb_url_hash列:

Now we can change the query and GROUP BY verb_url_hash column:

MySQL 5.7的嚴格模式是預設啟用的,我們可以隻針對這次查詢修改一下。

現在解釋計劃看上去好多了:

覆寫索引

現在我們可以嘗試做一個覆寫索引,這将相當大:

Now we can attempt to do a covered index, which will be quite large:

我們添加了一個“verb”和“URL”,是以之前我不得不删除表定義的COLLATE utf8mb4_unicode_ci。現在執行計劃表明,我們使用了覆寫索引:

We had to add a “verb” and “url”, so beforehand I had to remove the COLLATE utf8mb4_unicode_ci from the table definition. Now explain shows that we’re using the index:

響應時間下降到約12秒!但是,索引的大小明顯地比僅verb_url_hash的索引(每個記錄16位元組)要大得多。

The response time dropped to ~12 seconds! However, the index size is significantly larger compared to just verb_url_hash (16 bytes per record).

結論

MySQL 5.7的生成列提供一個有價值的方法來提高查詢性能。如果你有一個有趣的案例,請在評論中分享。

MySQL 5.7 generated columns provide a valuable way to improve query performance. If you have an interesting case, please share in the comments.

原文釋出時間為:2018-02-25

本文作者:星耀隊@知數堂