C# 正则表达式解析电子邮件-CPU负载高_C#_Regex

C# 正则表达式解析电子邮件-CPU负载高

c# regex

C# 正则表达式解析电子邮件-CPU负载高,c#,regex,C#,Regex,可能重复：我目前正在使用以下正则表达式和代码来解析html文档中的电子邮件地址 string pattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*"; Regex regex = new Regex( pattern, RegexOptions.None | RegexOptions.Compiled); MatchCollection matches = regex.Matches(input); // H

可能重复：

我目前正在使用以下正则表达式和代码来解析html文档中的电子邮件地址

string pattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";
Regex regex = new Regex(
      pattern,
      RegexOptions.None | RegexOptions.Compiled);

MatchCollection matches = regex.Matches(input); // Here is where it takes time
MessageBox.Show(matches.Count.ToString());

foreach (Match match in matches)
{
    ...
}

例如：

尝试解析

http://www.amelia.se/Pages/Amelia-search-result-page/?q=

在RegexHero，它崩溃了

有什么办法可以优化这一点吗？

为了详细说明@Joey的建议，我建议逐行检查您的输入，删除任何不包含

的行，并将您的正则表达式应用于包含的行。这将大大减少负载

private List<Match> find_emails_matches()
{
    List<Match> result = new List<Match>();

    using (FileStream stream = new FileStream(@"C:\tmp\test.txt", FileMode.Open, FileAccess.Read))
    {
        using(StreamReader reader = new StreamReader(stream))
        {
            string pattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";
            Regex regex = new Regex(pattern, RegexOptions.None | RegexOptions.Compiled);

            string line;
            while((line = reader.ReadLine()) != null)
            {
                if (line.Contains('@'))
                {
                    MatchCollection matches = regex.Matches(line); // Here is where it takes time                            
                    foreach(Match m in matches) result.Add(m);
                }
            }
        }
    }

    return result;
}

private List find\u email\u matches（）
{
列表结果=新列表（）；
使用（FileStream stream=newfilestream（@“C:\tmp\test.txt”，FileMode.Open，FileAccess.Read））
{
使用（StreamReader=新StreamReader（stream））
{
字符串模式=@“\w+（[-+.]\w+*@\w+（[-.]\w+*\）。\w+（[-.]\w+*”；
Regex Regex=new Regex（模式，RegexOptions.None | RegexOptions.Compiled）；
弦线；
而（（line=reader.ReadLine（））！=null）
{
if（第行包含（'@'））
{
MatchCollection matches=regex.matches（line）；//这是需要时间的地方
foreach（匹配中的匹配m）结果。添加（m）；
}
}
}
}
返回结果；
}

1。不要使用正则表达式解析HTML，请使用适当的解析器。2.不匹配如果字符串是使用正则表达式的电子邮件，请使用库（例如，使用正则表达式的复杂性请参见），我只能想到一个从任意HTML文档提取电子邮件地址的原因，我当然不支持这个原因。请阅读此文：。这就是正则表达式速度慢且CPU负载高的原因。@Elvin:Read to complete article；-）@CodeCaster：灾难性回溯也是一个问题，这在正则表达式引擎中几乎是普遍存在的。但是.NET有一个JavaScript没有的解决方案（甚至在我对JavaScript问题的回答中也有概述）。非常好用！！感谢bunchNo需要详细说明我的答案；这对于特定的情况是错误的。我是从验证的角度来的，不是提取。真是聪明的解决方案！