Mysql 慢相关子查询的问题
我有一个在MySQL中运行的查询。如您所见,查询的每个部分都位于索引字段上。尽管如此,查询需要花费很长时间(几十分钟,比我愿意等待的时间还要长)。连接表由两个整数和两个索引组成(一个字段一,字段二,另一个字段二,字段一)。源和目标是具有单个索引整型字段的表。考虑到所有索引,我希望这个查询在几秒钟内完成。关于1:为什么要花这么长时间,2:如何让它更快,有什么建议吗 谢谢Mysql 慢相关子查询的问题,mysql,sql,optimization,Mysql,Sql,Optimization,我有一个在MySQL中运行的查询。如您所见,查询的每个部分都位于索引字段上。尽管如此,查询需要花费很长时间(几十分钟,比我愿意等待的时间还要长)。连接表由两个整数和两个索引组成(一个字段一,字段二,另一个字段二,字段一)。源和目标是具有单个索引整型字段的表。考虑到所有索引,我希望这个查询在几秒钟内完成。关于1:为什么要花这么长时间,2:如何让它更快,有什么建议吗 谢谢 mysql> explain SELECT DISTINCT geneConnect.geneSymbolID FROM
mysql> explain
SELECT DISTINCT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect
JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID
JOIN InDels2 AS source ON source.id = snpEConnect.indelID
WHERE geneConnect.geneSymbolID NOT IN (
SELECT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect
JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID
JOIN InDels3 AS target ON target.id = snpEConnect.indelID);
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
| 1 | PRIMARY | source | index | id | id | 4 | NULL | 5771 | Using index; Using temporary |
| 1 | PRIMARY | snpEConnect | ref | snpEList | snpEList | 4 | treattablebrowser.source.id | 2 | Using index |
| 1 | PRIMARY | geneConnect | ref | snpEList | snpEList | 4 | treattablebrowser.snpEConnect.snpEffectID | 1 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | geneConnect | ref | snpEList,geneList | geneList | 4 | func | 1 | Using index |
| 2 | DEPENDENT SUBQUERY | target | index | id | id | 4 | NULL | 6297 | Using index; Using join buffer |
| 2 | DEPENDENT SUBQUERY | snpEConnect | ref | snpEList | snpEList | 8 | treattablebrowser.target.id,treattablebrowser.geneConnect.snpEffectID | 1 | Using index |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
一组6行(0.01秒)我想这在很大程度上是学术界的兴趣所在,现在格雷格自己解决了这个问题。很高兴知道我对这些事情的直觉会完全崩溃。我仍然可以用三种方式重写它。第一个我认为可以简化,但正如格雷格指出的,简化不起作用。虽然在我的sql server测试中确实生成了不同的查询计划,但不确定这是否会比原来的更快
Select Distinct
g1.geneSymbolID
From
SNPEffectGeneConnector AS g1
Inner Join
IndelSNPEffectConnector AS s1
ON g1.snpEffectID = s1.snpEffectID
Inner Join
InDels2 AS i2 ON i2.id = s1.indelID
Where Not Exists (
Select 'x'
From
SNPEffectGeneConnector As g2
Inner Join
IndelSNPEffectConnector AS s2
On g2.snpEffectID = s2.snpEffectID
Inner Join
InDels3 As i3
On i3.id = s2.indelID
Where
g2.geneSymbolID = g1.geneSymbolID
);
我不是100%确定第二种方法,但它对我的少量测试数据有效。如果可行的话,它有一个更短的查询计划(不一定更快,但这是一个很好的指示):
另一种方法(对非描述性别名表示歉意):
上述查询应该与您的查询相同,并提供更好的性能。如果我理解正确,您希望在
snpeEffectGeneConnector
中找到所有genesymorId
中的indelsnpeEffectConnector
中有条目的在InDels2
中,但中没有与InDels3
中相同的indelID
对应的匹配项
然后,您可以运行查询的第一部分(“do”部分),然后进一步连接最后一部分,从而收集所有匹配的基因。一个左连接
与基因符号表施加一个匹配失败,然后将产生所有不符合反向标准的基因,因此值得关注
订正答复
这是符合以下条件的查询:
现在,对于这个查询,我认为您需要以下索引:
CREATE INDEX SNPEffectGeneConnector_ndx
ON SNPEffectGeneConnector(snpEffectID, geneSymbolID);
CREATE INDEX SNPEffectGeneConnector_ndx2
ON SNPEffectGeneConnector(geneSymbolID);
CREATE INDEX IndelSNPEffectConnector_ndx
ON IndelSNPEffectConnector(snpEffectID, indelID);
CREATE [UNIQUE?] INDEX InDels2_ndx ON InDels2(id); -- unless id is primary key
CREATE [UNIQUE?] INDEX InDels3_ndx ON InDels3(id); -- unless id is primary key
要获取感兴趣的基因:
SELECT glob.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS glob
LEFT JOIN (
SELECT DISTINCT genes.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )
JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
JOIN InDels3 ON ( indelTarget.indelId = InDels3.id )
) AS fits ON (glob.geneSymbolID = fits.geneSymbolID)
WHERE fits.geneSymbolID IS NULL;
试验
因此,由于基因100与55相连,而55与1相连,因此在ID2中被注意到,
但它也连接到88,88连接到2,因此在ID3中,它不能
出现
会出现什么?如果我已经了解了这些要求,我们需要一种基因,它会产生一种效应,其indel没有列在inDels3
中。因此,比如说,基因42,引起效应77,与indel 3相关,而indel 3在indel 3中不存在,必须出现
因此:
屈服
+--------------+
| geneSymbolID |
+--------------+
| 42 |
+--------------+
第一个查询的修改可用于检查为什么42去,而100不去:
SELECT genes.geneSymbolID, effectSource.snpEffectID, effectTarget.snpEffectID, indelSource.indelId AS sourceInDel, indelTarget.indelId AS targetInDel, InDels3.id
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )
JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
LEFT JOIN InDels3 ON ( indelTarget.indelId = InDels3.id );
+--------------+-------------+-------------+-------------+-------------+------+
| geneSymbolID | snpEffectID | snpEffectID | sourceInDel | targetInDel | id |
+--------------+-------------+-------------+-------------+-------------+------+
| 42 | 55 | 55 | 1 | 1 | NULL |
| 42 | 55 | 77 | 1 | 3 | NULL |
| 100 | 55 | 55 | 1 | 1 | NULL |
| 100 | 55 | 88 | 1 | 2 | 2 |
+--------------+-------------+-------------+-------------+-------------+------+
…100有一行,其InDels3的ID不为null,并且它报告目标indel2。事实证明,问题是,虽然所有内容都有索引,但子查询返回的基因ID没有索引。加入/对非索引的数字集合执行IN搜索的性能非常差,这就是我得到的
我的解决方案是分别进行外部连接和内部连接,将结果转储到两个不同的索引表中,然后删除1中的GeneID,这些GeneID也包含在2中
这个故事的寓意是:永远不要加入或加入任何未编制索引的集合。尝试使用另一个JOIN而不是IN()子句。In的速度非常慢,实际上并不是In的速度慢,本质上,它是针对一个没有索引的项目池。对索引表进行索引非常快。不能这样做,因为无法在indel Id上进行匹配,必须在gene Id上进行匹配。请提供一些测试数据,这样会在查询中产生不同的结果。我希望以下数据会允许InDels2中的值通过,而实际上它不应该通过。不指数2(来源):1;指标3(目标):2;IndelSNPEffectConnector(snpEConnect):1:55;2:88; SNPEffectGene连接器:55:100;88:100;@GregDougherty对此表示感谢,它使用不同的查询进行了更新,并在中提供了一个链接,用于处理您的测试数据。对我来说这是一个很好的教训。但是,如果将左侧外部联接索引2
更改为内部联接索引2
,则第二个建议不需要检查是否有count(i2.id)>0。索引已经存在。我要寻找的是不同的indel ID具有相同的基因ID。所以这不起作用。你能提供,比如,四个表中的每一个表中的三行-仅仅是ID-以及这些行所需输出的示例吗?我很有信心能解决一些问题。见我对劳伦斯的评论。至少在我的例子中,通过将基因ID转储到一个索引表中,并使用该表,这个问题很容易解决,查询时间从30多分钟(我放弃了)到7秒。所需的查询更复杂。我很高兴你解决了这个问题,但我能请你试试这个新的解决方案吗?考虑到我需要添加的查询,我对性能不太抱希望,但我仍然很好奇。target需要针对其自己的IndelSNPEffectConnector进行链接,因为我不想将结果限制为匹配的indel ID。解决方案是:永远不要在中使用;而是使用相关子查询,使用EXISTS(select*FROM x)
,或等效的JOIN x。。。其中x.y为空
。
CREATE INDEX SNPEffectGeneConnector_ndx
ON SNPEffectGeneConnector(snpEffectID, geneSymbolID);
CREATE INDEX SNPEffectGeneConnector_ndx2
ON SNPEffectGeneConnector(geneSymbolID);
CREATE INDEX IndelSNPEffectConnector_ndx
ON IndelSNPEffectConnector(snpEffectID, indelID);
CREATE [UNIQUE?] INDEX InDels2_ndx ON InDels2(id); -- unless id is primary key
CREATE [UNIQUE?] INDEX InDels3_ndx ON InDels3(id); -- unless id is primary key
SELECT glob.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS glob
LEFT JOIN (
SELECT DISTINCT genes.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )
JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
JOIN InDels3 ON ( indelTarget.indelId = InDels3.id )
) AS fits ON (glob.geneSymbolID = fits.geneSymbolID)
WHERE fits.geneSymbolID IS NULL;
CREATE TABLE InDels2 ( id integer );
INSERT INTO InDels2 VALUES ( 1 );
CREATE TABLE InDels3 ( id integer );
INSERT INTO InDels3 VALUES ( 2 );
CREATE TABLE IndelSNPEffectConnector ( indelId integer, snpEffectID integer );
INSERT INTO IndelSNPEffectConnector VALUES ( 1, 55 ), ( 2, 88 );
CREATE TABLE SNPEffectGeneConnector ( geneSymbolID integer, snpEffectID integer );
INSERT INTO SNPEffectGeneConnector VALUES ( 100, 55 ), ( 100, 88 );
INSERT INTO SNPEffectGeneConnector VALUES ( 42, 55 );
INSERT INTO SNPEffectGeneConnector VALUES ( 42, 77 );
INSERT INTO IndelSNPEffectConnector VALUES ( 3, 77 );
+--------------+
| geneSymbolID |
+--------------+
| 42 |
+--------------+
SELECT genes.geneSymbolID, effectSource.snpEffectID, effectTarget.snpEffectID, indelSource.indelId AS sourceInDel, indelTarget.indelId AS targetInDel, InDels3.id
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )
JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
LEFT JOIN InDels3 ON ( indelTarget.indelId = InDels3.id );
+--------------+-------------+-------------+-------------+-------------+------+
| geneSymbolID | snpEffectID | snpEffectID | sourceInDel | targetInDel | id |
+--------------+-------------+-------------+-------------+-------------+------+
| 42 | 55 | 55 | 1 | 1 | NULL |
| 42 | 55 | 77 | 1 | 3 | NULL |
| 100 | 55 | 55 | 1 | 1 | NULL |
| 100 | 55 | 88 | 1 | 2 | 2 |
+--------------+-------------+-------------+-------------+-------------+------+