在这篇博客中, 我们将看看如何使用 MySQL 5.7 的虚拟列来提高查询性能
In this blog post, well look at ways you can use MySQL 5.7 generated columns (or virtual columns) to improve query performance.
说明
大约两年前, 我发表了一个在 MySQL5.7 版本上关于虚拟列的文章从那时开始, 它成为 MySQL5.7 发行版当中, 我最喜欢的一个功能点原因很简单: 在虚拟列的帮助下, 我们可以创建间接索引(fine-grained indexes), 可以显著提高查询性能我要告诉你一些技巧, 可以潜在地解决那些使用了 GROUP BY 和 ORDER BY 而慢的报表查询
About two years ago I published a blog post about Generated (Virtual) Columns in MySQL 5.7. Since then, its been one of my favorite features in the MySQL 5.7 release. The reason is simple: with the help of virtual columns, we can create fine-grained indexes that can significantly increase query performance. Im going to show you some tricks that can potentially fix slow reporting queries with GROUP BY and ORDER BY.
问题
最近我正在协助一位客户, 他正挣扎于这个查询上:
- Recently I was working with a customer who was struggling with this query:
- SELECT
- CONCAT(verb, '-', replace(url,'.xml','')) AS'API Call',
- COUNT(*) as 'No. of API Calls',
- AVG(ExecutionTime) as 'Avg. Execution Time',
- COUNT(distinct AccountId) as 'No. Of Accounts',
- COUNT(distinct ParentAccountId) as 'No. Of Parents'
- FROM ApiLog
- WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'
- GROUP BY CONCAT(verb, '-', replace(url,'.xml',''))
- HAVING COUNT(*) >= 1 ;
这个查询运行了一个多小时, 并且使用和撑满了整个 tmp 目录(需要用到临时文件完成排序)
The query was running for more than an hour and used all space in the tmp directory (with sort files).
表结构如下:
- The table looked like this:
- CREATE TABLE `ApiLog` (
- `Id` int(11) NOT NULL AUTO_INCREMENT,
- `ts` timestamp DEFAULT CURRENT_TIMESTAMP,
- `ServerName` varchar(50) NOT NULL default '',
- `ServerIP` varchar(50) NOT NULL default '',
- `ClientIP` varchar(50) NOT NULL default '',
- `ExecutionTime` int(11) NOT NULL default 0,
- `URL` varchar(3000) NOT NULL COLLATE utf8mb4_unicode_ci NOT NULL,
- `Verb` varchar(16) NOT NULL,
- `AccountId` int(11) NOT NULL,
- `ParentAccountId` int(11) NOT NULL,
- `QueryString` varchar(3000) NOT NULL,
- `Request` text NOT NULL,
- `RequestHeaders` varchar(2000) NOT NULL,
- `Response` text NOT NULL,
- `ResponseHeaders` varchar(2000) NOT NULL,
- `ResponseCode` varchar(4000) NOT NULL,
- ... // other fields removed for simplicity
- PRIMARY KEY (`Id`),
- KEY `index_timestamp` (`ts`),
- ... // other indexes removed for simplicity
- ) ENGINE=InnoDB;
我们发现查询没有使用时间戳字段 (TS) 的索引:
- We found out the query was not using an index on the timestamp field (ts):
- mysql> explain SELECT CONCAT(verb, '-', replace(url,'.xml','')) AS'API Call', COUNT(*) as'No. of API Calls', avg(ExecutionTime) as'Avg. Execution Time', count(distinct AccountId) as'No. Of Accounts', count(distinct ParentAccountId) as'No. Of Parents'FROM ApiLog WHERE ts between'2017-10-01 00:00:00'and'2017-12-31 23:59:59'GROUP BY CONCAT(verb,' - ', replace(url,'.xml','')) HAVING COUNT(*) >= 1G
- *************************** 1. row ***************************
- id: 1
- select_type: SIMPLE
- table: ApiLog
- partitions: NULL
- type: ALL
- possible_keys: ts
- key: NULL
- key_len: NULL
- ref: NULL
- rows: 22255292
- filtered: 50.00
- Extra: Using where; Using filesort1 row in set, 1 warning (0.00 sec)
原因很简单: 符合过滤条件的行数太大了, 以至于影响一次索引扫描扫描的效率(或者至少优化器是这样认为的):
- The reason for that is simple: the number of rows matching the filter condition was too large for an index scan to be efficient (or at least the optimizer thinks that):
- mysql> select count(*) from ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' ;
- +----------+
- | count(*) |
- +----------+
- | 7948800 |
- +----------+
- 1 row in set (2.68 sec)
总行数: 21998514 查询需要扫描的总行数的 36%(7948800/21998514)(译者按: 当预估扫描行数超过 20% ~ 30% 时, 即便有索引, 优化器通常也会强制转成全表扫描)
Total number of rows: 21998514. The query needs to scan 36% of the total rows (7948800 / 21998514).
在这种情况下, 我们有许多处理方法:
创建时间戳列和 GROUP BY 列的联合索引;
创建一个覆盖索引(包含所有查询字段);
仅对 GROUP BY 列创建索引;
创建索引松散索引扫描
- In this case, we have a number of approaches:
- Create a combined index on timestamp column + group by fields
- Create a covered index (including fields that are selected)
- Create an index on just GROUP BY fields
- Create an index for loose index scan
然而, 如果我们仔细观察查询中 GROUP BY 部分, 我们很快就意识到, 这些方案都不能解决问题以下是我们的 GROUP BY 部分:
- However, if we look closer at the GROUP BY part of the query, we quickly realize that none of those solutions will work. Here is our GROUP BY part:
- GROUP BY CONCAT(verb, '-', replace(url,'.xml',''))
这里有两个问题:
它是计算列, 所以 MySQL 不能扫描 verb + url 的索引它首先需要连接两个字段, 然后组成连接字符串这就意味着用不到索引;
URL 被定义为 varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL, 不能被完全索引 (即使在全 innodb_large_prefix= 1 参数设置下, 这是 UTF8 启用下的默认参数) 我们能做部分索引, 这对 GROUP BY 的 sql 优化并没有什么帮助
- There are two problems here:
- It is using a calculating field, so MySQL cant just scan the index on verb + url. It needs to first concat two fields, and then group on the concatenated string. That means that the index wont be used.
- The URL is declared as varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL and cant be indexed in full (even with innodb_large_prefix=1 option, which is the default as we have utf8 enabled). We can only do a partial index, which wont be helpful for GROUP BY optimization.
在这里, 我尝试去对 URL 列添加一个完整的索引, 在 innodb_large_prefix=1 参数下:
- Here, Im trying to add a full index on the URL with innodb_large_prefix=1:
- mysql> alter table ApiLog add key verb_url(verb, url);
- ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes
嗯, 通过修改 GROUP BY CONCAT(verb, , replace(url,.xml,))为 GROUP BY verb, url 会帮助 (假设我们把字段定义从 varchar(3000) 调小一些, 不管业务上允许或不允许)然而, 这将改变结果, 因 URL 字段不会删除 .xml 扩展名了
Well, changing the GROUP BY CONCAT(verb, , replace(url,.xml,)) to GROUP BY verb, url could help (assuming that we somehow trim the field definition from varchar(3000) to something smaller, which may or may not be possible). However, it will change the results as it will not remove the .xml extension from the URL field.
解决方案
好消息是, 在 MySQL 5.7 中我们有虚拟列所以我们可以在 CONCAT(verb, , replace(url,.xml,))之上创建一个虚拟列最好的部分: 我们不需要执行一组完整的字符串 (可能大于 3000 字节) 我们可以使用 MD5 哈希 (或更长的哈希, 例如 SHA1 / SHA2) 作为 GROUP BY 的对象
The good news is that in MySQL 5.7 we have virtual columns. So we can create a virtual column on top of CONCAT(verb, , replace(url,.xml,)). The best part: we do not have to perform a GROUP BY with the full string (potentially > 3000 bytes). We can use an MD5 hash (or longer hashes, i.e., sha1/sha2) for the purposes of the GROUP BY.
下面是解决方案:
- Here is the solution:
- alter table ApiLog add verb_url_hash varbinary(16) GENERATED ALWAYS AS (unhex(md5(CONCAT(verb, '-', replace(url,'.xml',''))))) VIRTUAL;
- alter table ApiLog add key (verb_url_hash);
所以我们在这里做的是:
声明虚拟列, 类型为 varbinary(16);
在 CONCAT(verb, , replace(url,.xml,)上创建虚拟列, 并且使用 MD5 哈希转化后再使用 unhex 转化 32 位十六进制为 16 位二进制;
对上面的虚拟列创建索引
- So what we did here is:
- Declared the virtual column with type varbinary(16)
- Created a virtual column on CONCAT(verb, , replace(url,.xml,), and used an MD5 hash on top plus an unhex to convert 32 hex bytes to 16 binary bytes
- Created and index on top of the virtual column
现在我们可以修改查询语句, GROUP BY verb_url_hash 列:
- Now we can change the query and GROUP BY verb_url_hash column:
- mysql> explain SELECT CONCAT(verb, '-', replace(url,'.xml',''))
- AS 'API Call', COUNT(*) as 'No. of API Calls',
- avg(ExecutionTime) as 'Avg. Execution Time',
- count(distinct AccountId) as 'No. Of Accounts',
- count(distinct ParentAccountId) as 'No. Of Parents'
- FROM ApiLog
- WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'
- GROUP BY verb_url_hash
- HAVING COUNT(*) >= 1;
- ERROR 1055 (42000): Expression #1 of SELECT list is not in
- GROUP BY clause and contains nonaggregated column 'ApiLog.ApiLog.Verb'
- which is not functionally dependent on columns in GROUP BY clause;
- this is incompatible with sql_mode=only_full_group_by
MySQL 5.7 的严格模式是默认启用的, 我们可以只针对这次查询修改一下
现在解释计划看上去好多了:
- MySQL 5.7 has a strict mode enabled by default, which we can change for that query only.
- Now the explain plan looks much better:
- mysql> select @@sql_mode;
- +-------------------------------------------------------------------------------------------------------------------------------------------+
- | @@sql_mode
- |+-------------------------------------------------------------------------------------------------------------------------------------------+
- | ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |
- +-------------------------------------------------------------------------------------------------------------------------------------------+
- 1 row in set (0.00 sec)
- mysql> set sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
- Query OK, 0 rows affected (0.00 sec)
- mysql> explain SELECT CONCAT(verb, '-', replace(url,'.xml','')) AS'API Call', COUNT(*) as'No. of API Calls', avg(ExecutionTime) as'Avg. Execution Time', count(distinct AccountId) as'No. Of Accounts', count(distinct ParentAccountId) as'No. Of Parents'FROM ApiLog WHERE ts between'2017-10-01 00:00:00'and'2017-12-31 23:59:59' GROUP BY verb_url_hash HAVING COUNT(*) >= 1G
- *************************** 1. row ***************************
- id: 1
- select_type: SIMPLE
- table: ApiLog
- partitions: NULL
- type: index
- possible_keys: ts,verb_url_hash
- key: verb_url_hash
- key_len: 19
- ref: NULL
- rows: 22008891
- filtered: 50.00
- Extra: Using where1 row in set, 1 warning (0.00 sec)
MySQL 可以避免排序, 速度更快它将最终还是要扫描所有表的索引的顺序响应时间明显更好: 只需大概 38 秒而不再是大于一小时
MySQL will avoid any sorting, which is much faster. It will still have to eventually scan all the table in the order of the index. The response time is significantly better: ~38 seconds as opposed to > an hour.
覆盖索引
现在我们可以尝试做一个覆盖索引, 这将相当大:
- Now we can attempt to do a covered index, which will be quite large:
- mysql> alter table ApiLog add key covered_index (`verb_url_hash`,`ts`,`ExecutionTime`,`AccountId`,`ParentAccountId`, verb, url);
- Query OK, 0 rows affected (1 min 29.71 sec)
- Records: 0 Duplicates: 0 Warnings: 0
我们添加了一个 verb 和 URL, 所以之前我不得不删除表定义的 COLLATE utf8mb4_unicode_ci 现在执行计划表明, 我们使用了覆盖索引:
- We had to add a verb and url, so beforehand I had to remove the COLLATE utf8mb4_unicode_ci from the table definition. Now explain shows that were using the index:
- mysql> explain SELECT CONCAT(verb, '-', replace(url,'.xml','')) AS'API Call', COUNT(*) as'No. of API Calls', AVG(ExecutionTime) as'Avg. Execution Time', COUNT(distinct AccountId) as'No. Of Accounts', COUNT(distinct ParentAccountId) as'No. Of Parents'FROM ApiLog WHERE ts between'2017-10-01 00:00:00'and'2017-12-31 23:59:59' GROUP BY verb_url_hash HAVING COUNT(*) >= 1G
- *************************** 1. row ***************************
- id: 1
- select_type: SIMPLE
- table: ApiLog
- partitions: NULL
- type: index
- possible_keys: ts,verb_url_hash,covered_index
- key: covered_index
- key_len: 3057
- ref: NULL
- rows: 22382136
- filtered: 50.00
- Extra: Using where; Using index
- 1 row in set, 1 warning (0.00 sec)
响应时间下降到约 12 秒! 但是, 索引的大小明显地比仅 verb_url_hash 的索引 (每个记录 16 字节) 要大得多
The response time dropped to ~12 seconds! However, the index size is significantly larger compared to just verb_url_hash (16 bytes per record).
结论
MySQL 5.7 的生成列提供一个有价值的方法来提高查询性能如果你有一个有趣的案例, 请在评论中分享
MySQL 5.7 generated columns provide a valuable way to improve query performance. If you have an interesting case, please share in the comments.
来源: https://yq.aliyun.com/articles/495586