Mysql 慢相关子查询的问题

Mysql 慢相关子查询的问题,mysql,sql,optimization,Mysql,Sql,Optimization,我有一个在MySQL中运行的查询。如您所见,查询的每个部分都位于索引字段上。尽管如此,查询需要花费很长时间(几十分钟,比我愿意等待的时间还要长)。连接表由两个整数和两个索引组成(一个字段一,字段二,另一个字段二,字段一)。源和目标是具有单个索引整型字段的表。考虑到所有索引,我希望这个查询在几秒钟内完成。关于1:为什么要花这么长时间,2:如何让它更快,有什么建议吗 谢谢 mysql> explain SELECT DISTINCT geneConnect.geneSymbolID FROM

我有一个在MySQL中运行的查询。如您所见,查询的每个部分都位于索引字段上。尽管如此,查询需要花费很长时间(几十分钟,比我愿意等待的时间还要长)。连接表由两个整数和两个索引组成(一个字段一,字段二,另一个字段二,字段一)。源和目标是具有单个索引整型字段的表。考虑到所有索引,我希望这个查询在几秒钟内完成。关于1:为什么要花这么长时间,2:如何让它更快,有什么建议吗

谢谢

mysql> explain 
SELECT DISTINCT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect 
  JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID 
  JOIN InDels2 AS source ON source.id = snpEConnect.indelID 
  WHERE geneConnect.geneSymbolID NOT IN (
    SELECT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect 
    JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID 
    JOIN InDels3 AS target ON target.id = snpEConnect.indelID);
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
| id | select_type        | table       | type  | possible_keys     | key      | key_len | ref                                                                   | rows | Extra                          |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
|  1 | PRIMARY            | source      | index | id                | id       | 4       | NULL                                                                  | 5771 | Using index; Using temporary   |
|  1 | PRIMARY            | snpEConnect | ref   | snpEList          | snpEList | 4       | treattablebrowser.source.id                                           |    2 | Using index                    |
|  1 | PRIMARY            | geneConnect | ref   | snpEList          | snpEList | 4       | treattablebrowser.snpEConnect.snpEffectID                             |    1 | Using where; Using index       |
|  2 | DEPENDENT SUBQUERY | geneConnect | ref   | snpEList,geneList | geneList | 4       | func                                                                  |    1 | Using index                    |
|  2 | DEPENDENT SUBQUERY | target      | index | id                | id       | 4       | NULL                                                                  | 6297 | Using index; Using join buffer |
|  2 | DEPENDENT SUBQUERY | snpEConnect | ref   | snpEList          | snpEList | 8       | treattablebrowser.target.id,treattablebrowser.geneConnect.snpEffectID |    1 | Using index                    |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+

一组6行(0.01秒)

我想这在很大程度上是学术界的兴趣所在,现在格雷格自己解决了这个问题。很高兴知道我对这些事情的直觉会完全崩溃。我仍然可以用三种方式重写它。第一个我认为可以简化,但正如格雷格指出的,简化不起作用。虽然在我的sql server测试中确实生成了不同的查询计划,但不确定这是否会比原来的更快

Select Distinct
    g1.geneSymbolID 
From
    SNPEffectGeneConnector AS g1 
        Inner Join
    IndelSNPEffectConnector AS s1 
        ON g1.snpEffectID = s1.snpEffectID 
        Inner Join
    InDels2 AS i2 ON i2.id = s1.indelID 
Where Not Exists (
    Select 'x'
        From
            SNPEffectGeneConnector As g2
                Inner Join
            IndelSNPEffectConnector AS s2 
                On g2.snpEffectID = s2.snpEffectID 
                Inner Join
            InDels3 As i3
                On i3.id = s2.indelID
        Where
            g2.geneSymbolID = g1.geneSymbolID
    );
我不是100%确定第二种方法,但它对我的少量测试数据有效。如果可行的话,它有一个更短的查询计划(不一定更快,但这是一个很好的指示):

另一种方法(对非描述性别名表示歉意):


上述查询应该与您的查询相同,并提供更好的性能。

如果我理解正确,您希望在
snpeEffectGeneConnector
中找到所有
genesymorId
中的
indelsnpeEffectConnector
中有条目的
InDels2
中,但中没有与
InDels3
中相同的
indelID
对应的匹配项

然后,您可以运行查询的第一部分(“do”部分),然后进一步连接最后一部分,从而收集所有匹配的基因。一个
左连接
与基因符号表施加一个匹配失败,然后将产生所有不符合反向标准的基因,因此值得关注

订正答复 这是符合以下条件的查询:

现在,对于这个查询,我认为您需要以下索引:

CREATE INDEX SNPEffectGeneConnector_ndx
    ON SNPEffectGeneConnector(snpEffectID, geneSymbolID);

CREATE INDEX SNPEffectGeneConnector_ndx2
    ON SNPEffectGeneConnector(geneSymbolID);

CREATE INDEX IndelSNPEffectConnector_ndx
    ON IndelSNPEffectConnector(snpEffectID, indelID);
CREATE [UNIQUE?] INDEX InDels2_ndx ON InDels2(id); -- unless id is primary key
CREATE [UNIQUE?] INDEX InDels3_ndx ON InDels3(id); -- unless id is primary key
要获取感兴趣的基因:

SELECT glob.geneSymbolID
    FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS glob
    LEFT JOIN (
SELECT DISTINCT genes.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
    ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
    ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
    ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
    ON ( indelTarget.snpEffectID = effectTarget.snpEffectID ) 
     JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
     JOIN InDels3 ON ( indelTarget.indelId = InDels3.id )
) AS fits ON (glob.geneSymbolID = fits.geneSymbolID)
WHERE fits.geneSymbolID IS NULL;
试验 因此,由于基因100与55相连,而55与1相连,因此在ID2中被注意到, 但它也连接到88,88连接到2,因此在ID3中,它不能 出现

会出现什么?如果我已经了解了这些要求,我们需要一种基因,它会产生一种效应,其indel没有列在
inDels3
中。因此,比如说,基因42,引起效应77,与indel 3相关,而indel 3在indel 3中不存在,必须出现

因此:

屈服

+--------------+
| geneSymbolID |
+--------------+
|           42 |
+--------------+
第一个查询的修改可用于检查为什么42去,而100不去:

SELECT genes.geneSymbolID, effectSource.snpEffectID, effectTarget.snpEffectID, indelSource.indelId AS sourceInDel, indelTarget.indelId AS targetInDel, InDels3.id
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
 JOIN SNPEffectGeneConnector AS effectSource
     ON ( genes.geneSymbolID = effectSource.geneSymbolID)
 JOIN SNPEffectGeneConnector AS effectTarget
     ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
 JOIN IndelSNPEffectConnector AS indelSource
     ON ( indelSource.snpEffectID = effectSource.snpEffectID )
 JOIN IndelSNPEffectConnector AS indelTarget
     ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )

      JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
 LEFT JOIN InDels3 ON ( indelTarget.indelId = InDels3.id );

+--------------+-------------+-------------+-------------+-------------+------+
| geneSymbolID | snpEffectID | snpEffectID | sourceInDel | targetInDel | id   |
+--------------+-------------+-------------+-------------+-------------+------+
|           42 |          55 |          55 |           1 |           1 | NULL |
|           42 |          55 |          77 |           1 |           3 | NULL |
|          100 |          55 |          55 |           1 |           1 | NULL |
|          100 |          55 |          88 |           1 |           2 |    2 |
+--------------+-------------+-------------+-------------+-------------+------+

…100有一行,其InDels3的ID不为null,并且它报告目标indel2。

事实证明,问题是,虽然所有内容都有索引,但子查询返回的基因ID没有索引。加入/对非索引的数字集合执行IN搜索的性能非常差,这就是我得到的

我的解决方案是分别进行外部连接和内部连接,将结果转储到两个不同的索引表中,然后删除1中的GeneID,这些GeneID也包含在2中


这个故事的寓意是:永远不要加入或加入任何未编制索引的集合。

尝试使用另一个JOIN而不是IN()子句。In的速度非常慢,实际上并不是In的速度慢,本质上,它是针对一个没有索引的项目池。对索引表进行索引非常快。不能这样做,因为无法在indel Id上进行匹配,必须在gene Id上进行匹配。请提供一些测试数据,这样会在查询中产生不同的结果。我希望以下数据会允许InDels2中的值通过,而实际上它不应该通过。不指数2(来源):1;指标3(目标):2;IndelSNPEffectConnector(snpEConnect):1:55;2:88; SNPEffectGene连接器:55:100;88:100;@GregDougherty对此表示感谢,它使用不同的查询进行了更新,并在中提供了一个链接,用于处理您的测试数据。对我来说这是一个很好的教训。但是,如果将
左侧外部联接索引2
更改为
内部联接索引2
,则第二个建议不需要检查
是否有count(i2.id)>0。索引已经存在。我要寻找的是不同的indel ID具有相同的基因ID。所以这不起作用。你能提供,比如,四个表中的每一个表中的三行-仅仅是ID-以及这些行所需输出的示例吗?我很有信心能解决一些问题。见我对劳伦斯的评论。至少在我的例子中,通过将基因ID转储到一个索引表中,并使用该表,这个问题很容易解决,查询时间从30多分钟(我放弃了)到7秒。所需的查询更复杂。我很高兴你解决了这个问题,但我能请你试试这个新的解决方案吗?考虑到我需要添加的查询,我对性能不太抱希望,但我仍然很好奇。target需要针对其自己的IndelSNPEffectConnector进行链接,因为我不想将结果限制为匹配的indel ID。解决方案是:永远不要在中使用;而是使用相关子查询,使用
EXISTS(select*FROM x)
,或等效的
JOIN x。。。其中x.y为空
CREATE INDEX SNPEffectGeneConnector_ndx
    ON SNPEffectGeneConnector(snpEffectID, geneSymbolID);

CREATE INDEX SNPEffectGeneConnector_ndx2
    ON SNPEffectGeneConnector(geneSymbolID);

CREATE INDEX IndelSNPEffectConnector_ndx
    ON IndelSNPEffectConnector(snpEffectID, indelID);
CREATE [UNIQUE?] INDEX InDels2_ndx ON InDels2(id); -- unless id is primary key
CREATE [UNIQUE?] INDEX InDels3_ndx ON InDels3(id); -- unless id is primary key
SELECT glob.geneSymbolID
    FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS glob
    LEFT JOIN (
SELECT DISTINCT genes.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
    ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
    ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
    ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
    ON ( indelTarget.snpEffectID = effectTarget.snpEffectID ) 
     JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
     JOIN InDels3 ON ( indelTarget.indelId = InDels3.id )
) AS fits ON (glob.geneSymbolID = fits.geneSymbolID)
WHERE fits.geneSymbolID IS NULL;
CREATE TABLE InDels2 ( id integer );
INSERT INTO InDels2 VALUES ( 1 );
CREATE TABLE InDels3 ( id integer );
INSERT INTO InDels3 VALUES ( 2 );
CREATE TABLE IndelSNPEffectConnector ( indelId integer, snpEffectID integer );
INSERT INTO IndelSNPEffectConnector VALUES ( 1, 55 ), ( 2, 88 );
CREATE TABLE SNPEffectGeneConnector ( geneSymbolID integer, snpEffectID integer );
INSERT INTO SNPEffectGeneConnector VALUES ( 100, 55 ), ( 100, 88 );
INSERT INTO SNPEffectGeneConnector VALUES ( 42, 55 );
INSERT INTO SNPEffectGeneConnector VALUES ( 42, 77 );
INSERT INTO IndelSNPEffectConnector VALUES ( 3, 77 );
+--------------+
| geneSymbolID |
+--------------+
|           42 |
+--------------+
SELECT genes.geneSymbolID, effectSource.snpEffectID, effectTarget.snpEffectID, indelSource.indelId AS sourceInDel, indelTarget.indelId AS targetInDel, InDels3.id
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
 JOIN SNPEffectGeneConnector AS effectSource
     ON ( genes.geneSymbolID = effectSource.geneSymbolID)
 JOIN SNPEffectGeneConnector AS effectTarget
     ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
 JOIN IndelSNPEffectConnector AS indelSource
     ON ( indelSource.snpEffectID = effectSource.snpEffectID )
 JOIN IndelSNPEffectConnector AS indelTarget
     ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )

      JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
 LEFT JOIN InDels3 ON ( indelTarget.indelId = InDels3.id );

+--------------+-------------+-------------+-------------+-------------+------+
| geneSymbolID | snpEffectID | snpEffectID | sourceInDel | targetInDel | id   |
+--------------+-------------+-------------+-------------+-------------+------+
|           42 |          55 |          55 |           1 |           1 | NULL |
|           42 |          55 |          77 |           1 |           3 | NULL |
|          100 |          55 |          55 |           1 |           1 | NULL |
|          100 |          55 |          88 |           1 |           2 |    2 |
+--------------+-------------+-------------+-------------+-------------+------+