C# 什么是解析单个句子的正则表达式？_C#_Regex

C# 什么是解析单个句子的正则表达式？

c# regex

C# 什么是解析单个句子的正则表达式？,c#,regex,C#,Regex,我正在寻找一个好的.NET正则表达式，可以用来解析文本体中的单个句子它应该能够将以下文本块精确地解析为六个句子： Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23. 事实证明，这比我最初想象的要有点

我正在寻找一个好的.NET正则表达式，可以用来解析文本体中的单个句子

它应该能够将以下文本块精确地解析为六个句子：

Hello world! How are you? I am fine.
This is a difficult sentence because I use I.D.

Newlines should also be accepted. Numbers should not cause  
sentence breaks, like 1.23.

事实证明，这比我最初想象的要有点挑战性

任何帮助都将不胜感激。我将用它在已知的文本体上训练系统。

试试这个

@”（\S++[！？]）（？=\S++$）”

：

结果:

Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.

当然，对于复杂的解析器，您需要一个真正的解析器，比如SharpNLP或NLTK。我的只是一个又快又脏的

以下是信息和功能：

SharpNLP是一个自然资源的集合用语言编写的语言处理工具 C#。目前它提供以下NLP工具：

分句器
标记器
词性标记
chunker（用于“查找非递归语法注释，如名词短语块”）
分析器
查名字的人
共指工具
WordNet词法数据库的接口

只有正则表达式才可能做到这一点，除非您确切知道自己拥有哪些“困难”标记，如“i.d.”、“先生”等。例如，有多少句话是“请出示您的ID，邦德先生”？我不熟悉任何C#实现，但我使用过NLTK。也许重新实施不应该太难

var str=@“你好，世界！你好吗？我很好。这是一个很难的句子，因为我用的是身份证。
var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex.Split(str, @"(?<=[.?!])\s+").Dump();

换行符也应该被接受。数字不应该像1.23那个样导致句子中断。”；
Regex.Split（str，@）（？不可能使用Regex来解析自然语言。句子的结尾是什么？句点可以出现在许多地方（例如）。您应该使用自然语言分析工具包，如OpenNLP或NLTK。不幸的是，C#中的产品（如果有的话）很少。因此，您可能需要创建Web服务或链接到C#
请注意，如果您依赖于“I.D.”中的精确空格，将来可能会出现问题。您很快就会发现破坏正则表达式的示例。例如，大多数人在其初始值后放置空格
WP（）中有一个关于开放和商业产品的优秀总结。我们已经使用了其中的一些产品。这是值得努力的
[你使用“训练”这个词。这通常与机器学习有关（这是NLP的一种方法，用于句子分割）。事实上，我提到的工具包包括机器学习。我怀疑这不是你的意思-而是你会通过启发式进化你的表达。不要！]
我使用了这里发布的建议，并提出了正则表达式，以实现我想要做的事情：
(?<Sentence>\S.+?(?<Terminator>[.!?]|\Z))(?=\s+|\Z)

（？\S.+？（？[！？]|\Z））（？=\S++\Z）

我曾经想到：
//  using System.Text.RegularExpressions;
/// <summary>
///  Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM
///  Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  [Sentence]: A named capture group. [\S.+?(?<Terminator>[.!?]|\Z)]
///      \S.+?(?<Terminator>[.!?]|\Z)
///          Anything other than whitespace
///          Any character, one or more repetitions, as few as possible
///          [Terminator]: A named capture group. [[.!?]|\Z]
///              Select from 2 alternatives
///                  Any character in this class: [.!?]
///                  End of string or before new line at end of string
///  Match a suffix but exclude it from the capture. [\s+|\Z]
///      Select from 2 alternatives
///          Whitespace, one or more repetitions
///          End of string or before new line at end of string
///  
///
/// </summary>
public static Regex regex = new Regex(
      "(?<Sentence>\\S.+?(?<Terminator>[.!?]|\\Z))(?=\\s+|\\Z)",
    RegexOptions.CultureInvariant
    | RegexOptions.IgnorePatternWhitespace
    | RegexOptions.Compiled
    );


// This is the replacement string
public static string regexReplace = 
      "$& [${Day}-${Month}-${Year}]";


//// Replace the matched text in the InputText using the replacement pattern
// string result = regex.Replace(InputText,regexReplace);

//// Split the InputText wherever the regex matches
// string[] results = regex.Split(InputText);

//// Capture the first Match, if any, in the InputText
// Match m = regex.Match(InputText);

//// Capture all Matches in the InputText
// MatchCollection ms = regex.Matches(InputText);

//// Test to see if there is a match in the InputText
// bool IsMatch = regex.IsMatch(InputText);

//// Get the names of all the named and numbered capture groups
// string[] GroupNames = regex.GetGroupNames();

//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = regex.GetGroupNumbers();

//使用System.Text.RegularExpressions；
/// 
///为C#on构建的正则表达式：Sun，2009年12月27日，03:05:24 PM
///使用Expresso版本：3.0.3276，http://www.ultrapico.com
///  
///正则表达式的说明：
///  
///[句子]：命名的捕获组。[\S.+？（？[！？]|\Z]
///\S.+？（？[.！？]|\Z）
///除了空白以外的任何内容
///任何字符，一个或多个重复，尽可能少
///[终止符]：命名的捕获组。[[.！？]|\Z]
///从两个备选方案中进行选择
///此类中的任何字符：[.！？]
///字符串末尾或字符串末尾的新行之前
///匹配后缀，但将其从捕获中排除。[\s+\Z]
///从两个备选方案中进行选择
///空白，一个或多个重复
///字符串末尾或字符串末尾的新行之前
///  
///
/// 
公共静态正则表达式Regex=新正则表达式(
“（？\\S.+？（？[！？]\\\Z））（？=\\S+\\\Z）”，
RegexOptions.CultureInvariant
|RegexOptions.IgnorePatternWhitespace
|RegexOptions.Compiled
);
//这是替换字符串
公共静态字符串regexReplace=
“$&[${Day}-${Month}-${Year}]”；
////使用替换模式替换InputText中匹配的文本
//字符串结果=regex.Replace（InputText，regexReplace）；
////在正则表达式匹配的位置拆分InputText
//string[]results=regex.Split（InputText）；
////捕获InputText中的第一个匹配项（如果有）
//匹配m=正则表达式匹配（InputText）；
////捕获InputText中的所有匹配项
//MatchCollection ms=regex.Matches（InputText）；
////测试以查看InputText中是否存在匹配项
//bool IsMatch=regex.IsMatch（InputText）；
////获取所有命名和编号的捕获组的名称
//字符串[]GroupNames=regex.GetGroupNames（）；
////获取所有命名和编号的捕获组的编号
//int[]GroupNumbers=regex.GetGroupNumbers（）；
大多数人建议使用SharpNLP，除非你想让你的QA部门有一个bug fest，否则你可能应该这样做
但是，因为你可能承受着某种压力。这里有另一种尝试来处理像“Dr.”和“X.”这样的词。但是，它会以“it”结尾的句子失败
你好，世界！你好吗？我很好。这是一个很难的句子
因为我使用ID，所以换行符也应该被接受。数字不应该被接受
引起句子中断，如1.23。幽门螺杆菌见B博士或FooBar先生
心脏评估
var result=new Regex（@“（\S.+？[！？]）（？=\S+|$）（？+1，用于将我们指向SharpNLP，这是我以前没有见过的，可能非常有用。最好对（？：\S+|$）使用前瞻性断言。谢谢您提供的Gumbo信息，它更好，但我必须在前面添加\S，因为左边的空格必须去掉。谢谢大家。这是一个有用的见解。我将在接下来的几天内试用。@Luke:看起来您希望在“原因”和“句子”之间有一个明显的分界在你的示例文本中，但它没有显示。我强迫它显示为I
//  using System.Text.RegularExpressions;
/// <summary>
///  Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM
///  Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  [Sentence]: A named capture group. [\S.+?(?<Terminator>[.!?]|\Z)]
///      \S.+?(?<Terminator>[.!?]|\Z)
///          Anything other than whitespace
///          Any character, one or more repetitions, as few as possible
///          [Terminator]: A named capture group. [[.!?]|\Z]
///              Select from 2 alternatives
///                  Any character in this class: [.!?]
///                  End of string or before new line at end of string
///  Match a suffix but exclude it from the capture. [\s+|\Z]
///      Select from 2 alternatives
///          Whitespace, one or more repetitions
///          End of string or before new line at end of string
///  
///
/// </summary>
public static Regex regex = new Regex(
      "(?<Sentence>\\S.+?(?<Terminator>[.!?]|\\Z))(?=\\s+|\\Z)",
    RegexOptions.CultureInvariant
    | RegexOptions.IgnorePatternWhitespace
    | RegexOptions.Compiled
    );


// This is the replacement string
public static string regexReplace = 
      "$& [${Day}-${Month}-${Year}]";


//// Replace the matched text in the InputText using the replacement pattern
// string result = regex.Replace(InputText,regexReplace);

//// Split the InputText wherever the regex matches
// string[] results = regex.Split(InputText);

//// Capture the first Match, if any, in the InputText
// Match m = regex.Match(InputText);

//// Capture all Matches in the InputText
// MatchCollection ms = regex.Matches(InputText);

//// Test to see if there is a match in the InputText
// bool IsMatch = regex.IsMatch(InputText);

//// Get the names of all the named and numbered capture groups
// string[] GroupNames = regex.GetGroupNames();

//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = regex.GetGroupNumbers();

    var result = new Regex(@"(\S.+?[.!?])(?=\s+|$)(?<!\s([A-Z]|[a-z]){1,3}.)").Split(input).Where(s => !String.IsNullOrWhiteSpace(s)).ToArray<string>();
    foreach (var match in result) 
    {
        Console.WriteLine(match);
    }