Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/csharp/339.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用正则表达式获取C#.NET中包含前N个单词的子字符串_C#_Regex_String - Fatal编程技术网

使用正则表达式获取C#.NET中包含前N个单词的子字符串

使用正则表达式获取C#.NET中包含前N个单词的子字符串,c#,regex,string,C#,Regex,String,我想从给定字符串中获取一个包含前N个单词的字符串。例如 先从敏捷的布朗那里得到5个单词,狐狸跳过了懒狗 应返回快速棕色,fox1跳过 请注意,word包含字母、数字和u(基本上是\W+匹配),并且保留了所有原始分隔符(例如) 我用经典的C代码做到了这一点,如下所示: public static bool IsWordChar(this char c) { return char.IsLetterOrDigit(c) || c == '_'; } public static string

我想从给定字符串中获取一个包含前N个单词的字符串。例如

先从敏捷的布朗那里得到5个单词,狐狸跳过了懒狗 应返回快速棕色,fox1跳过

请注意,word包含字母、数字和u(基本上是\W+匹配),并且保留了所有原始分隔符(例如)

我用经典的C代码做到了这一点,如下所示:

public static bool IsWordChar(this char c)
{
    return char.IsLetterOrDigit(c) || c == '_';
}

public static string GetFirstWords(string s, int wordCount, string truncateSuffix = " [...]")
{
    var sb = new StringBuilder();
    int currWordCount = 0;
    char prevC = '\0';
    foreach (var c in s)
    {
        sb.Append(c);
        if (!c.IsWordChar() && prevC.IsWordChar())
            currWordCount++;

        if (currWordCount >= wordCount)
        {
            if (sb.Length < s.Length)
                sb.Append(truncateSuffix);

            return sb.ToString();
        }

        prevC = c;
    }

    // adding last word, if necessary
    if (prevC.IsWordChar())
        sb.Append(prevC);

    return sb.ToString();
}
公共静态bool-IsWordChar(此字符为c)
{
return char.IsleterOrdigit(c)| | c=='';
}
公共静态字符串GetFirstWords(字符串s,int-wordCount,字符串truncateSuffix=“[…]”)
{
var sb=新的StringBuilder();
int currWordCount=0;
字符prevC='\0';
foreach(s中的变量c)
{
sb.附加(c);
如果(!c.IsWordChar()&&prevC.IsWordChar())
currWordCount++;
如果(currWordCount>=wordCount)
{
如果(sb.Length
它的工作速度足以满足我的需要(O(n)),但我想知道是否可以使用正则表达式实现这一点

我尝试使用
\W+
并获取前N个匹配项,但我从原始文本中去掉了实际的非单词分隔符

问题:是否存在上述代码的C#regex等价物?


谢谢。

我会使用单词边界(
\b
)来查找单词,而不仅仅是
\w
\w

如果我稍微修改一下你的问题,搜索前N个单词和N-1个“单词之间的事物”,你也许可以使用

Regex.Match("The quick_brown, fox1 jumps over the lazy dog", @"^(\b.+?\b){9}")
以获得N=5的预期结果


请注意,这假设输入以一个单词开头。

从较长字符串中提取包含前五个单词的siubstring的正则表达式是

@"^\W*\w+(?:\W+\w+){4}"

详细信息

  • ^
    -字符串的开头
  • \W*
    -零个或多个非单词符号
  • \w+
    -1+文字符号
  • (?:\W+\W+{4}
    -4个序列(如果输入字符串中的字数可能少于5个且预期输出为整个字符串,则替换为
    {0,4}
    ):
    • \W+
      -1+非单词字符
    • \w+
      -1+字字符
无论正则表达式是否更有效,您都需要在C#中测试解决方案。要有效地使用正则表达式,请使用
RegexOptions.Compiled
声明为
readonly
字段,然后使用
regex.Match
调用。见:


对于那些想要快速工作代码的人,我将根据公认的答案发布我正在使用的代码:

public static string GetFirstWordsRegEx(string s, int wordCount, string truncateSuffix = " [...]")
{
    // replace with string.Format for C# less than 6.0
    string pattern = $@"^\W*\w+(?:\W+\w+){{{wordCount - 1}}}";
    var regex = new Regex(pattern);
    var match = regex.Match(s);
    if (!match.Success)
        return s;

    var ret = match.Value;
    return ret.Length < s.Length ? ret + truncateSuffix : ret;
}
测试代码

var sw = new Stopwatch();
var rand = new Random();
const int minValue = 20;
const int maxValue = 150;

#region warm up
sw.Start();
Console.WriteLine($"Warming up...");

// _testStrings contains 100K random real texts which may be longer or not than first words truncation value
foreach (var str in _testStrings)
{
    var dummy = str;
}
Console.WriteLine($"Warm up took {sw.ElapsedMilliseconds} ms");
#endregion

#region Classic C# approach
foreach (var str in _testStrings)
{
    int wordCount = rand.Next(minValue, maxValue);
    var firstWords = Utils.GetFirstWords(str, wordCount);
}
Console.WriteLine($"Classic code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion

#region Uncompiled regex
sw.Start();
foreach (var str in _testStrings)
{
    int wordCount = rand.Next(minValue, maxValue);
    var firstWords = Utils.GetFirstWordsRegEx(str, wordCount);
}
Console.WriteLine($"Uncompiled regex code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion

#region Compiled regex
sw.Start();
foreach (var str in _testStrings)
{

    int wordCount = rand.Next(minValue, maxValue);
    var firstWords = Utils.GetFirstWordsRegExOptimized(str, wordCount);
}
Console.WriteLine($"Compiled regex code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion
结果

经典代码耗时953毫秒

未编译的正则表达式代码耗时5559毫秒

编译的正则表达式代码耗时4194毫秒


正如所料,编译的正则表达式比未编译的正则表达式快。但是,经典版本要快得多。

我想您可以尝试
Regex.Replace(s,@“(?s)^(\W*\W+(?:\W+\W+{4})。*”,“$1[…”
,请参阅。如果字符串应包含少于5个单词,并且结果应包含整个输入,请将
{4}
替换为
{0,4}
。是的,它与Match一起正常工作(无需替换)。请把它作为答案贴出来,这样我就可以接受了。谢谢。Regex将比您的方法慢(至少如果字符串很大,所以它很重要),因此出于性能原因,请不要替换它。请注意,您可能不想使用RegexOptions。为每次调用时创建的正则表达式编译。Wiktor的答案明确地将其移动到(静态)字段中,以创建/编译它一次。考虑到您的表达式随着输入(“字数”)的变化而变化,我可能会保留您的方式(即在方法中),但完全删除该选项。你自己看并测量/决定是的,那是正确的。我把它拿走了。非常感谢。
    // generated all possible text truncation patterns
    private static readonly List<Regex> FirstWordRegexes = new List<Regex>
    {
        new Regex(@"^\W*\w+(?:\W+\w+){0}", RegexOptions.Compiled),
        new Regex(@"^\W*\w+(?:\W+\w+){1}", RegexOptions.Compiled),
        new Regex(@"^\W*\w+(?:\W+\w+){2}", RegexOptions.Compiled),
        new Regex(@"^\W*\w+(?:\W+\w+){3}", RegexOptions.Compiled),
        new Regex(@"^\W*\w+(?:\W+\w+){4}", RegexOptions.Compiled),
        new Regex(@"^\W*\w+(?:\W+\w+){5}", RegexOptions.Compiled),
        new Regex(@"^\W*\w+(?:\W+\w+){6}", RegexOptions.Compiled),
        // ...
        // removed for brevity
        // ...
        new Regex(@"^\W*\w+(?:\W+\w+){147}", RegexOptions.Compiled),
        new Regex(@"^\W*\w+(?:\W+\w+){148}", RegexOptions.Compiled),
        new Regex(@"^\W*\w+(?:\W+\w+){149}", RegexOptions.Compiled),
    };

    public static string GetFirstWordsRegExOptimized(string s, int wordCount, string truncateSuffix = " [...]")
    {
        var regex = FirstWordRegexes[wordCount-1];
        var match = regex.Match(s);
        if (!match.Success)
            return s;

        var ret = match.Value;
        return ret.Length < s.Length ? ret + truncateSuffix : ret;
    }
var sw = new Stopwatch();
var rand = new Random();
const int minValue = 20;
const int maxValue = 150;

#region warm up
sw.Start();
Console.WriteLine($"Warming up...");

// _testStrings contains 100K random real texts which may be longer or not than first words truncation value
foreach (var str in _testStrings)
{
    var dummy = str;
}
Console.WriteLine($"Warm up took {sw.ElapsedMilliseconds} ms");
#endregion

#region Classic C# approach
foreach (var str in _testStrings)
{
    int wordCount = rand.Next(minValue, maxValue);
    var firstWords = Utils.GetFirstWords(str, wordCount);
}
Console.WriteLine($"Classic code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion

#region Uncompiled regex
sw.Start();
foreach (var str in _testStrings)
{
    int wordCount = rand.Next(minValue, maxValue);
    var firstWords = Utils.GetFirstWordsRegEx(str, wordCount);
}
Console.WriteLine($"Uncompiled regex code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion

#region Compiled regex
sw.Start();
foreach (var str in _testStrings)
{

    int wordCount = rand.Next(minValue, maxValue);
    var firstWords = Utils.GetFirstWordsRegExOptimized(str, wordCount);
}
Console.WriteLine($"Compiled regex code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion