有没有办法优化这个mysql查询(更新、多连接)?
我有一个查询,它在被截断的数据集上执行我想要的操作,但是当我在完整的数据集(数百万行)上运行它时,它需要永远运行下去 我有两张桌子——microsat_桌子和coverage_桌子 microsat_表:有没有办法优化这个mysql查询(更新、多连接)?,mysql,sql,query-optimization,Mysql,Sql,Query Optimization,我有一个查询,它在被截断的数据集上执行我想要的操作,但是当我在完整的数据集(数百万行)上运行它时,它需要永远运行下去 我有两张桌子——microsat_桌子和coverage_桌子 microsat_表: +----+----------+-----------+---------+-------------------------------------------------+ | id | Seq_Name | SSR_Start | SSR_End | Sequence
+----+----------+-----------+---------+-------------------------------------------------+
| id | Seq_Name | SSR_Start | SSR_End | Sequence |
+----+----------+-----------+---------+-------------------------------------------------+
| 2 | chr2L | 11050 | 11067 | TTTAATTTAATTTAATTT |
| 3 | chr2L | 44173 | 44187 | TATGTATGTATGTAT |
| 5 | chr2L | 54431 | 54477 | ATAATAATATAATATAATATAATATAATATATAATAATATAATAATA |
| 6 | chr2L | 57571 | 57594 | ATATATATATATATATATATATAT |
| 7 | chr2L | 72439 | 72453 | CATACATACATACAT |
| 8 | chr2L | 74028 | 74042 | ATACATACATACATA |
| 9 | chr2L | 85573 | 85587 | ATTTTATTTTATTTT |
| 10 | chr2L | 92429 | 92443 | ACATACATACATACA |
| 11 | chr2L | 138132 | 138166 | TATATAGATATATAAATATATATATATATATATAT |
| 13 | chr2L | 162245 | 162259 | ATACATACATACATA |
+----+----------+-----------+---------+-------------------------------------------------+
覆盖率表:
| Seq_Name | Start | Stop | Coverage |
+----------+-------+-------+----------+
| chr2L | 5716 | 5771 | 1 |
| chr2L | 8730 | 8824 | 1 |
| chr2L | 9894 | 9948 | 1 |
| chr2L | 19391 | 19491 | 1 |
| chr2L | 19575 | 19675 | 1 |
| chr2L | 19773 | 19776 | 1 |
| chr2L | 19776 | 19872 | 2 |
| chr2L | 21920 | 21959 | 1 |
| chr2L | 21959 | 22020 | 2 |
| chr2L | 22020 | 22059 | 1 |
+----------+-------+-------+----------+
我想在microsat_表中添加一列,计算覆盖率表中的开始值和停止值在microsat_表中的SSR_开始值和SSR_结束值范围内的所有行的平均覆盖率(从覆盖率_表)
示例结果:
+-----+----------+-----------+---------+--------------------------------+---------+
| id | Seq_Name | SSR_Start | SSR_End | Sequence | avg |
+-----+----------+-----------+---------+--------------------------------+---------+
| 53 | chr2L | 402489 | 402503 | AAAACAAAACAAAAC | 3.0000 |
| 64 | chr2L | 447214 | 447233 | CAGCAGCAGCAGCAGCAGCA | 8.0000 |
| 66 | chr2L | 457839 | 457868 | CAGCAGCAGCAACAGCAGCAGCAGGCAGCA | 2.0000 |
| 105 | chr2L | 579589 | 579603 | TCGAATCGAATCGAA | 11.0000 |
| 123 | chr2L | 628484 | 628501 | TAATGTTAATGTTAATGT | 6.0000 |
+-----+----------+-----------+---------+--------------------------------+---------+
我的问题是:
UPDATE microsat_table
JOIN
(SELECT m.id, SUM(p.Coverage)/count(p.Start)
AS avg FROM microsat_table m
LEFT OUTER JOIN coverage_table p
ON m.Seq_Name LIKE p.Seq_Name
WHERE m.Seq_Name LIKE p.Seq_Name GROUP BY m.id) AS qt
ON microsat_table.id = qt.id
SET microsat_table.avg = qt.avg;
解释截断表的结果:
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
| 1 | UPDATE | microsat_table_short | NULL | ALL | PRIMARY | NULL | NULL | NULL | 40356 | 100.00 | NULL |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 4 | testdb.microsat_table_short.id | 1236 | 100.00 | NULL |
| 2 | DERIVED | m | NULL | index | PRIMARY,Sequence,Seq_Name,Motif,SSR_Start,SSR_End | Seq_Name | 53 | NULL | 40356 | 100.00 | Using index; Using temporary; Using filesort |
| 2 | DERIVED | p | NULL | ALL | NULL | NULL | NULL | NULL | 100163 | 1.23 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
|id |选择|类型|表格|分区|类型|可能的|键|键|列|参考|行|过滤|额外|
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
|1 |更新|微卫星|表|短|空|所有|主|空|空|空| 40356 | 100.00 |空|
|1 | PRIMARY | | | NULL | ref | | | | | 4 | testdb.microsat | u table | u short.id 1246 | 100.00 | NULL|
|2 |派生| m | NULL | index | PRIMARY,Seq | u Name,Motif,SSR | u Start,SSR | u End | Seq | u Name | 53 | NULL | 40356 | 100.00 |使用index;使用临时设备;使用文件排序|
|2 |导出的| p | NULL | ALL | NULL | NULL | NULL | NULL | 100163 | 1.23 |使用where;使用联接缓冲区(块嵌套循环)|
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
我添加了索引(包括尝试HASH和BTREE索引),这大大加快了速度,但我让它在更大的数据集上运行了1.5天,仍然没有完成
有人对如何让它运行得更快有什么建议吗
谢谢 您的代码中有一些相对较小的缺陷。然而,最大的问题是,当您说要计算“覆盖率表中的开始值和停止值位于microsat表中的SSR_开始值和SSR_结束值范围内的所有行的平均覆盖率(来自覆盖率_表)”时,您实际上似乎并没有将查询限制为这样做。相反,您只在
Seq_Name
上编码了一个匹配项
下面的代码试图解决这个问题(我使用了
=
和可能在一个大事务中更新表对系统来说太多了?(您正在更新的表的大小是多少?)您可以试着分块进行。我还可以在这里选择一个简单的子选项,看起来更容易阅读
还请注意Steve Lovell的评论,即您的查询似乎不关心开始/停止列。因为您可能是偶然忘记了它,所以我也在这里添加了它,删除它应该不会太困难=)
DECLARE@min\u id int,
@max_id int,
@块大小整数
选择@min_id=min(id),
@max_id=max(id),
@blocksize=100000——根据需要进行调整
从microsat_表
而@min_id=m.SSR_Start——公然从Steve Lovell的答案中“窃取”
最后,请为实际的表添加查询计划,因为这是最慢的,结果集与数据集不对应。请参阅,在解决(并验证)此建议之前,无法回答有关索引的原始问题。谢谢!!它在命令行中将查询从40秒缩短为8秒!出于某种原因,当我在脚本中运行完全相同的查询时,它会将其延长到900秒:-/。我正试图弄清楚那里发生了什么,但似乎它必须是脚本,而不是查询。谢谢!!是的,我确实是偶然忘了:-)。问题-我以前从未使用过mysql存储过程和变量,我是否需要将所有代码放入函数中,还是可以按原样将其输入命令行?当我将它输入为时,我得到了一个语法错误,并且还尝试将它放入一个函数中,这也给了我一个语法错误。我肯定是做了错误的函数声明,因为我以前从未使用过它们…我承认我已经习惯了MSSQL。我似乎天真但错误地认为这几乎是ANSI SQL,在MySQL上也不会有太多麻烦。试图让它在sqlfiddle节目上运行我大错特错了。。。我将尝试让它在MySQL上运行,然后调整我的代码。。。(这可能需要一段时间=)
UPDATE microsat_table
JOIN
(
SELECT
m.id,
AVG(p.Coverage) AS avg -- MySQL has it's own average function
FROM
microsat_table m
INNER JOIN coverage_table p ON -- Change to INNER JOIN, your old WHERE clause had this effect anyway
m.Seq_Name = p.Seq_Name -- Use '=' not 'Like' when looking for an exact match
WHERE
p.Start >= m.SSR_Start -- This WHERE clause is the most important change
AND p.End <= m.SSR_End -- You omitted it in your version
GROUP BY
m.id) AS qt
ON microsat_table.id = qt.id
SET microsat_table.avg = qt.avg;
DECLARE @min_id int,
@max_id int,
@blocksize int
SELECT @min_id = MIN(id),
@max_id = MAX(id),
@blocksize = 100000 -- adapt as needed
FROM microsat_table
WHILE @min_id <= @max_id
BEGIN
UPDATE microsat_table
SET microsat_table.avg = ((SELECT SUM(p.Coverage)/count(p.Start) AS avg
FROM microsat_table m
LEFT OUTER JOIN coverage_table p
ON m.Seq_Name LIKE p.Seq_Name -- if possble use '=' here instead of LIKE
AND p.Start >= m.SSR_Start -- flagrantly "stolen" from Steve Lovell's answer
AND p.End <= m.SSR_End
WHERE m.id = microsat_table.id)
-- limit update to this block:
WHERE microsat_table.id BETWEEN @min_id AND (@min_id + @blocksize - 1)
-- prepare for next block
SELECT @min_id = @min_id + @blocksize
END