使用正则表达式获取C#.NET中包含前N个单词的子字符串
我想从给定字符串中获取一个包含前N个单词的字符串。例如 先从敏捷的布朗那里得到5个单词,狐狸跳过了懒狗 应返回快速棕色,fox1跳过 请注意,word包含字母、数字和u(基本上是\W+匹配),并且保留了所有原始分隔符(例如) 我用经典的C代码做到了这一点,如下所示:使用正则表达式获取C#.NET中包含前N个单词的子字符串,c#,regex,string,C#,Regex,String,我想从给定字符串中获取一个包含前N个单词的字符串。例如 先从敏捷的布朗那里得到5个单词,狐狸跳过了懒狗 应返回快速棕色,fox1跳过 请注意,word包含字母、数字和u(基本上是\W+匹配),并且保留了所有原始分隔符(例如) 我用经典的C代码做到了这一点,如下所示: public static bool IsWordChar(this char c) { return char.IsLetterOrDigit(c) || c == '_'; } public static string
public static bool IsWordChar(this char c)
{
return char.IsLetterOrDigit(c) || c == '_';
}
public static string GetFirstWords(string s, int wordCount, string truncateSuffix = " [...]")
{
var sb = new StringBuilder();
int currWordCount = 0;
char prevC = '\0';
foreach (var c in s)
{
sb.Append(c);
if (!c.IsWordChar() && prevC.IsWordChar())
currWordCount++;
if (currWordCount >= wordCount)
{
if (sb.Length < s.Length)
sb.Append(truncateSuffix);
return sb.ToString();
}
prevC = c;
}
// adding last word, if necessary
if (prevC.IsWordChar())
sb.Append(prevC);
return sb.ToString();
}
公共静态bool-IsWordChar(此字符为c)
{
return char.IsleterOrdigit(c)| | c=='';
}
公共静态字符串GetFirstWords(字符串s,int-wordCount,字符串truncateSuffix=“[…]”)
{
var sb=新的StringBuilder();
int currWordCount=0;
字符prevC='\0';
foreach(s中的变量c)
{
sb.附加(c);
如果(!c.IsWordChar()&&prevC.IsWordChar())
currWordCount++;
如果(currWordCount>=wordCount)
{
如果(sb.Length
它的工作速度足以满足我的需要(O(n)),但我想知道是否可以使用正则表达式实现这一点
我尝试使用\W+
并获取前N个匹配项,但我从原始文本中去掉了实际的非单词分隔符
问题:是否存在上述代码的C#regex等价物?
谢谢。我会使用单词边界(
\b
)来查找单词,而不仅仅是\w
和\w
如果我稍微修改一下你的问题,搜索前N个单词和N-1个“单词之间的事物”,你也许可以使用
Regex.Match("The quick_brown, fox1 jumps over the lazy dog", @"^(\b.+?\b){9}")
以获得N=5的预期结果
请注意,这假设输入以一个单词开头。从较长字符串中提取包含前五个单词的siubstring的正则表达式是
@"^\W*\w+(?:\W+\w+){4}"
见
详细信息:
-字符串的开头^
-零个或多个非单词符号\W*
-1+文字符号\w+
-4个序列(如果输入字符串中的字数可能少于5个且预期输出为整个字符串,则替换为(?:\W+\W+{4}
):{0,4}
-1+非单词字符\W+
-1+字字符\w+
RegexOptions.Compiled
声明为readonly
字段,然后使用regex.Match
调用。见:
对于那些想要快速工作代码的人,我将根据公认的答案发布我正在使用的代码:
public static string GetFirstWordsRegEx(string s, int wordCount, string truncateSuffix = " [...]")
{
// replace with string.Format for C# less than 6.0
string pattern = $@"^\W*\w+(?:\W+\w+){{{wordCount - 1}}}";
var regex = new Regex(pattern);
var match = regex.Match(s);
if (!match.Success)
return s;
var ret = match.Value;
return ret.Length < s.Length ? ret + truncateSuffix : ret;
}
测试代码
var sw = new Stopwatch();
var rand = new Random();
const int minValue = 20;
const int maxValue = 150;
#region warm up
sw.Start();
Console.WriteLine($"Warming up...");
// _testStrings contains 100K random real texts which may be longer or not than first words truncation value
foreach (var str in _testStrings)
{
var dummy = str;
}
Console.WriteLine($"Warm up took {sw.ElapsedMilliseconds} ms");
#endregion
#region Classic C# approach
foreach (var str in _testStrings)
{
int wordCount = rand.Next(minValue, maxValue);
var firstWords = Utils.GetFirstWords(str, wordCount);
}
Console.WriteLine($"Classic code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion
#region Uncompiled regex
sw.Start();
foreach (var str in _testStrings)
{
int wordCount = rand.Next(minValue, maxValue);
var firstWords = Utils.GetFirstWordsRegEx(str, wordCount);
}
Console.WriteLine($"Uncompiled regex code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion
#region Compiled regex
sw.Start();
foreach (var str in _testStrings)
{
int wordCount = rand.Next(minValue, maxValue);
var firstWords = Utils.GetFirstWordsRegExOptimized(str, wordCount);
}
Console.WriteLine($"Compiled regex code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion
结果
经典代码耗时953毫秒
未编译的正则表达式代码耗时5559毫秒
编译的正则表达式代码耗时4194毫秒
正如所料,编译的正则表达式比未编译的正则表达式快。但是,经典版本要快得多。我想您可以尝试
Regex.Replace(s,@“(?s)^(\W*\W+(?:\W+\W+{4})。*”,“$1[…”
,请参阅。如果字符串应包含少于5个单词,并且结果应包含整个输入,请将{4}
替换为{0,4}
。是的,它与Match一起正常工作(无需替换)。请把它作为答案贴出来,这样我就可以接受了。谢谢。Regex将比您的方法慢(至少如果字符串很大,所以它很重要),因此出于性能原因,请不要替换它。请注意,您可能不想使用RegexOptions。为每次调用时创建的正则表达式编译。Wiktor的答案明确地将其移动到(静态)字段中,以创建/编译它一次。考虑到您的表达式随着输入(“字数”)的变化而变化,我可能会保留您的方式(即在方法中),但完全删除该选项。你自己看并测量/决定是的,那是正确的。我把它拿走了。非常感谢。
// generated all possible text truncation patterns
private static readonly List<Regex> FirstWordRegexes = new List<Regex>
{
new Regex(@"^\W*\w+(?:\W+\w+){0}", RegexOptions.Compiled),
new Regex(@"^\W*\w+(?:\W+\w+){1}", RegexOptions.Compiled),
new Regex(@"^\W*\w+(?:\W+\w+){2}", RegexOptions.Compiled),
new Regex(@"^\W*\w+(?:\W+\w+){3}", RegexOptions.Compiled),
new Regex(@"^\W*\w+(?:\W+\w+){4}", RegexOptions.Compiled),
new Regex(@"^\W*\w+(?:\W+\w+){5}", RegexOptions.Compiled),
new Regex(@"^\W*\w+(?:\W+\w+){6}", RegexOptions.Compiled),
// ...
// removed for brevity
// ...
new Regex(@"^\W*\w+(?:\W+\w+){147}", RegexOptions.Compiled),
new Regex(@"^\W*\w+(?:\W+\w+){148}", RegexOptions.Compiled),
new Regex(@"^\W*\w+(?:\W+\w+){149}", RegexOptions.Compiled),
};
public static string GetFirstWordsRegExOptimized(string s, int wordCount, string truncateSuffix = " [...]")
{
var regex = FirstWordRegexes[wordCount-1];
var match = regex.Match(s);
if (!match.Success)
return s;
var ret = match.Value;
return ret.Length < s.Length ? ret + truncateSuffix : ret;
}
var sw = new Stopwatch();
var rand = new Random();
const int minValue = 20;
const int maxValue = 150;
#region warm up
sw.Start();
Console.WriteLine($"Warming up...");
// _testStrings contains 100K random real texts which may be longer or not than first words truncation value
foreach (var str in _testStrings)
{
var dummy = str;
}
Console.WriteLine($"Warm up took {sw.ElapsedMilliseconds} ms");
#endregion
#region Classic C# approach
foreach (var str in _testStrings)
{
int wordCount = rand.Next(minValue, maxValue);
var firstWords = Utils.GetFirstWords(str, wordCount);
}
Console.WriteLine($"Classic code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion
#region Uncompiled regex
sw.Start();
foreach (var str in _testStrings)
{
int wordCount = rand.Next(minValue, maxValue);
var firstWords = Utils.GetFirstWordsRegEx(str, wordCount);
}
Console.WriteLine($"Uncompiled regex code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion
#region Compiled regex
sw.Start();
foreach (var str in _testStrings)
{
int wordCount = rand.Next(minValue, maxValue);
var firstWords = Utils.GetFirstWordsRegExOptimized(str, wordCount);
}
Console.WriteLine($"Compiled regex code took {sw.ElapsedMilliseconds} ms");
sw.Restart();
#endregion