MySQL REGEXP未按预期运行_Mysql_Regex

MySQL REGEXP未按预期运行

mysql regex

MySQL REGEXP未按预期运行,mysql,regex,Mysql,Regex,我的问题是，如果在不同的上下文中使用相同的模式，regexp似乎不会给出相同的结果。比如说, str REGEXP 'a|b|c|d|e' //not equal to str REGEXP 'a' OR str REGEXP 'b' OR ... 我对一个大型数据集执行以下查询，其中“内容”是源代码文件形式的文本。我使用内部REGEXP搜索包含我的关键字列表的任何源文件。一旦我有了完整的列表，我会再次检查它，并检查哪个特定的关键字绊倒了它。这就是存在差异的地方；某些文件在对所有关键字进行检

我的问题是，如果在不同的上下文中使用相同的模式，regexp似乎不会给出相同的结果。比如说,

str REGEXP 'a|b|c|d|e'
//not equal to
str REGEXP 'a' OR str REGEXP 'b' OR ...

我对一个大型数据集执行以下查询，其中“内容”是源代码文件形式的文本。我使用内部REGEXP搜索包含我的关键字列表的任何源文件。一旦我有了完整的列表，我会再次检查它，并检查哪个特定的关键字绊倒了它。这就是存在差异的地方；某些文件在对所有关键字进行检查时会跳闸，但在对每个单独的关键字进行检查时不会跳闸

select
    source_histories.id,
    MAX(source_histories.master_event_id) as master_event_id,
    source_histories.source_file_id,
    source_histories.content REGEXP '[;{}[:space:]]break;' as break,
    source_histories.content REGEXP '[;{}[:space:]]break ' as break_label,
    source_histories.content REGEXP '[;{}[:space:]]continue;' as `continue`,
    source_histories.content REGEXP '[;{}[:space:]]throw ' as throw,
    source_histories.content REGEXP '[;{}[:space:]]return;' as return_void
from
    source_histories,
    (SELECT 
        DISTINCT source_file_id
    from
        source_histories
    where
        ifnull(content, '') REGEXP '[;{}[:space:]]break;|[;{}[:space:]]break |[;{}[:space:]]continue;|[;{}[:space:]]throw |[;{}[:space:]]return;'
    LIMIT 100
    ) as sourceIdList
where
    source_histories.source_file_id = sourceIdList.source_file_id 
group by
    source_histories.source_file_id;

下面是包含问题的结果部分。正如您所看到的，当单独检查时，源文件id 92和95与任何关键字都不匹配，但在与所有关键字进行检查时，必须匹配。我浏览了他们的源代码，它们确实包含一个或多个关键字

id  master_event_id  source_file_id  break break_label  continue  throw  return_void 
256 3260             63              1     0            0         1      0
258 3640             65              1     0            0         0      0
259 3640             66              0     0            0         1      0
320 93722            85              1     0            0         0      0
346 471              92              0     0            0         0      0
360 93731            95              0     0            0         0      0
483 96052            108             1     0            0         0      0
536 1010             112             0     0            0         1      0

有人对我的问题有什么建议吗？这是由于一个轻微的过度关注，还是mySQL的细微差别

解决方案：问题在于我分析数据的顺序。我正在查找与我的条件匹配的唯一源文件id，但无法保证该文件的相应最新版本（max master事件id）也包含关键字。以下是我找到的解决方案（无论多么缓慢）

select
    source_histories.id,
    source_histories.master_event_id as master_event_id,
    source_histories.source_file_id,
    source_histories.content REGEXP '[;{}[:space:]]break;' as break,
    source_histories.content REGEXP '[;{}[:space:]]break ' as break_label,
    source_histories.content REGEXP '[;{}[:space:]]continue;' as `continue`,
    source_histories.content REGEXP '[;{}[:space:]]throw ' as throw,
    source_histories.content REGEXP '[;{}[:space:]]return;' as return_void
from
    source_histories
    inner join
        (select
            source_histories.id,
            MAX(source_histories.master_event_id) as master_event_id,
            source_histories.source_file_id
        from
            source_histories
            inner join
                (SELECT
                    DISTINCT source_file_id
                FROM
                    source_histories
                LIMIT 100
                ) as distinctSHList
            on
                source_histories.source_file_id = distinctSHList.source_file_id
            group by
                source_file_id
        ) as lastestSourceList
    on source_histories.id = lastestSourceList.id
where
    ifnull(content, '') REGEXP '[;{}[:space:]]break;|[;{}[:space:]]break |[;{}[:space:]]continue;|[;{}[:space:]]throw |[;{}[:space:]]return;';

问题不在

REGEX

子句中，而是在您进行选择的方式中

子查询将考虑给定

源文件id

的所有

源历史记录，而主查询（由于分组）将只考虑给定源文件id
的一条源历史记录
要进行验证，请从查询中删除GROUP BY
和MAX
子句，并加入源_histories.id
；结果应该是一致的
select
    source_histories.id,
    source_histories.source_file_id,
    source_histories.content REGEXP '[;{}[:space:]]break;' as break,
    source_histories.content REGEXP '[;{}[:space:]]break ' as break_label,
    source_histories.content REGEXP '[;{}[:space:]]continue;' as `continue`,
    source_histories.content REGEXP '[;{}[:space:]]throw ' as throw,
    source_histories.content REGEXP '[;{}[:space:]]return;' as return_void
from
    source_histories,
    (SELECT 
        id
    from
        source_histories
    where
        ifnull(content, '') REGEXP '[;{}[:space:]]break;|[;{}[:space:]]break |[;{}[:space:]]continue;|[;{}[:space:]]throw |[;{}[:space:]]return;'
    LIMIT 100
    ) as sourceHistoriesIdList
where
    source_histories.id = sourceHistoriesIdList.id

如果我理解正确，您的意思是：a）子查询将筛选每个源文件id的源历史记录的所有内容，b）根据相应的源文件id（在子查询中找到），一次分析源历史记录中的一个条目。这是查询所需的效果。根据我的理解，子查询返回条目的子集，主查询执行更详细的分析，再次创建子集的子集（可以这么说）。我的问题是，如果与每个源文件id相对应的内容是常量，那么会有什么不同（集合或子集）？子查询返回一个源文件id
列表，其中至少有一条记录与您的正则表达式匹配。在主查询中，您可以为每个选定的源文件\u id
选择一条记录，但不能保证您在主查询中选择的记录也被子查询选中。您只知道：1）它们具有相同的源文件id
，2）对于该源文件id
，至少有一条记录匹配。根据您的建议，我删除了分组依据
和最大值
子句。正如您所建议的，它们不匹配，但它确实显示了我正在处理的数据的某些方面。我会处理这个问题，然后再回复你。是的，你说得对。他们不会匹配的。您需要加入source\u histories.id
以使它们匹配。我会更新我的答案。通过一些游戏，我发现在开始使用正则表达式之前，我需要先获得max master_event_id。谢谢你的帮助！我将发布我提出的解决方案。