C# \d效率低于[0-9]_C#_Regex_Performance

C# \d效率低于[0-9]

c# regex performance

C# \d效率低于[0-9],c#,regex,performance,C#,Regex,Performance,我昨天就一个答案发表了评论，其中有人在正则表达式中使用了[0123456789]，而不是[0-9]或\d。我说使用范围或数字说明符可能比使用字符集更有效我今天决定测试一下，结果出乎意料地发现（至少在c#regex引擎中）\d似乎比其他两个效率都低，这两个引擎似乎差别不大。下面是我对10000个随机字符串（1000个随机字符）的测试输出，其中5077实际包含一个数字： Regex \d took 00:00:00.2141226 result: 5077/10000 Rege

我昨天就一个答案发表了评论，其中有人在正则表达式中使用了

[0123456789]

，而不是

[0-9]

或

\d

。我说使用范围或数字说明符可能比使用字符集更有效

我今天决定测试一下，结果出乎意料地发现（至少在c#regex引擎中）

\d

似乎比其他两个效率都低，这两个引擎似乎差别不大。下面是我对10000个随机字符串（1000个随机字符）的测试输出，其中5077实际包含一个数字：

Regex \d           took 00:00:00.2141226 result: 5077/10000
Regex [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first
Regex [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first

有两个原因让我感到惊讶，如果有人能解释一下，我会很感兴趣：

我本以为这个范围会比集合更有效地实施

我不明白为什么

\d

比

[0-9]

更糟糕。除了简写

[0-9]

，还有更多的

\d

吗

以下是测试代码：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace SO_RegexPerformance
{
    class Program
    {
        static void Main(string[] args)
        {
            var rand = new Random(1234);
            var strings = new List<string>();
            //10K random strings
            for (var i = 0; i < 10000; i++)
            {
                //generate random string
                var sb = new StringBuilder();
                for (var c = 0; c < 1000; c++)
                {
                    //add a-z randomly
                    sb.Append((char)('a' + rand.Next(26)));
                }
                //in roughly 50% of them, put a digit
                if (rand.Next(2) == 0)
                {
                    //replace 1 char with a digit 0-9
                    sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
                }
                strings.Add(sb.ToString());
            }

            var baseTime = testPerfomance(strings, @"\d");
            Console.WriteLine();
            var testTime = testPerfomance(strings, "[0-9]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            testTime = testPerfomance(strings, "[0123456789]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
        }

        private static TimeSpan testPerfomance(List<string> strings, string regex)
        {
            var sw = new Stopwatch();

            int successes = 0;

            var rex = new Regex(regex);

            sw.Start();
            foreach (var str in strings)
            {
                if (rex.Match(str).Success)
                {
                    successes++;
                }
            }
            sw.Stop();

            Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);

            return sw.Elapsed;
        }
    }
}

使用系统；
使用System.Collections.Generic；
使用System.Linq；
使用系统文本；
使用系统诊断；
使用System.Text.RegularExpressions；
命名空间SO_RegexPerformance
{
班级计划
{
静态void Main（字符串[]参数）
{
var rand=新随机变量（1234）；
var strings=新列表（）；
//10K随机字符串
对于（变量i=0；i<10000；i++）
{
//生成随机字符串
var sb=新的StringBuilder（）；
对于（var c=0；c<1000；c++）
{
//随机添加a-z
某人附加（（字符）（'a'+rand.Next（26））；
}
//在大约50%的样本中，输入一个数字
if（rand.Next（2）==0）
{
//用数字0-9替换1个字符
sb[rand.Next（sb.Length）]=（char）（'0'+rand.Next（10））；
}
添加（sb.ToString（））；
}
var baseTime=testperformance（字符串@“\d”）；
Console.WriteLine（）；
var testTime=testperformance（字符串“[0-9]”）；
WriteLine（“{0:P2}第一个”，testTime.totalmillizes/baseTime.totalmillizes）；
testTime=TestPerformance（字符串“[0123456789]”）；
WriteLine（“{0:P2}第一个”，testTime.totalmillizes/baseTime.totalmillizes）；
}
私有静态TimeSpan测试性能（列表字符串、字符串正则表达式）
{
var sw=新秒表（）；
int=0；
var rex=新正则表达式（正则表达式）；
sw.Start（）；
foreach（字符串中的var str）
{
如果（重复匹配（str.Success）
{
成功++；
}
}
sw.Stop（）；
Write（“Regex{0，-12}获取{1}结果：{2}/{3}”，Regex，sw.appeased，successfulls，strings.Count）；
返回经过的开关；
}
}
}

\d

检查所有Unicode数字，而

[0-9]

仅限于这10个字符。例如，数字，

是与\d
匹配的Unicode数字的一个示例，但与[0-9]
不匹配
您可以使用以下代码生成所有此类字符的列表：
var sb = new StringBuilder();
for(UInt16 i = 0; i < UInt16.MaxValue; i++)
{
    string str = Convert.ToChar(i).ToString();
    if (Regex.IsMatch(str, @"\d"))
        sb.Append(str);
}
Console.WriteLine(sb.ToString());

var sb=new StringBuilder（）；
对于（UInt16 i=0；i

由此产生：
0123456789०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙０１２３４５６７８９
发件人：
[0-9]
并不等同于\d
[0-9]
只匹配0123456789
字符，而\d
匹配[0-9]
和其他数字字符，例如东方阿拉伯数字
感谢ByteBlast在文档中注意到这一点。只需更改正则表达式构造函数：
var rex = new Regex(regex, RegexOptions.ECMAScript);

提供新的计时：
Regex \d           took 00:00:00.1355787 result: 5077/10000
Regex [0-9]        took 00:00:00.1360403 result: 5077/10000  100.34 % of first
Regex [0123456789] took 00:00:00.1362112 result: 5077/10000  100.47 % of first

除了from，这里是他的代码的.NET 4.5版本（因为只有该版本支持UTF16输出，前三行c.f.），
使用全部Unicode代码点。
由于缺乏对更高的Unicode平面的适当支持，许多人并不知道总是检查并包括更高的Unicode平面。然而，它们有时确实包含一些重要的字符
更新
由于\d
不支持正则表达式中的非BMP字符（谢谢），这里提供了一个使用Unicode字符数据库的版本
更新2
多亏了，我已经将缺少的引用添加到UCD（通过NuGet package Unicode Deinformation）。还升级到最新的.NET核心版本和UTF-8输出
// reference https://www.nuget.org/packages/UnicodeInformation/
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.Unicode;
                    
public class Program
{
    public static void Main()
    {
        var unicodeEncoding = new UTF8Encoding(false);
        Console.OutputEncoding = unicodeEncoding;

        var numberCategories = new HashSet<UnicodeCategory>(new []{
            UnicodeCategory.DecimalDigitNumber,
            UnicodeCategory.LetterNumber,
            UnicodeCategory.OtherNumber
        });
        var numberLikeChars =
            from codePoint in Enumerable.Range(0, 0x10ffff)
            where codePoint > UInt16.MaxValue 
                || (!char.IsLowSurrogate((char) codePoint) && !char.IsHighSurrogate((char) codePoint))
            let charInfo = UnicodeInfo.GetCharInfo(codePoint)
            where numberCategories.Contains(charInfo.Category)
            let codePointString = char.ConvertFromUtf32(codePoint)
            select (codePoint, charInfo, codePointString);

        foreach (var (codePoint, charInfo, codePointString) in numberLikeChars)
        {
            Console.Write("U+{0} ", codePoint.ToString("X6"));
            Console.Write(" {0,-4}", codePointString);
            Console.Write(" {0,-40}", charInfo.Name ?? charInfo.OldName);
            Console.Write(" {0,-6}", CharUnicodeInfo.GetNumericValue(codePointString, 0));
            Console.Write(" {0,-6}", CharUnicodeInfo.GetDigitValue(codePointString, 0));
            Console.Write(" {0,-6}", CharUnicodeInfo.GetDecimalDigitValue(codePointString, 0));
            Console.WriteLine(" {0}", charInfo.Category);
        }
    }
}

//参考https://www.nuget.org/packages/UnicodeInformation/
使用制度；
使用System.Collections.Generic；
使用System.Linq；
使用系统文本；
利用制度全球化；
使用System.Unicode；
公共课程
{