C# 什么是解析单个句子的正则表达式?
我正在寻找一个好的.NET正则表达式,可以用来解析文本体中的单个句子 它应该能够将以下文本块精确地解析为六个句子:C# 什么是解析单个句子的正则表达式?,c#,regex,C#,Regex,我正在寻找一个好的.NET正则表达式,可以用来解析文本体中的单个句子 它应该能够将以下文本块精确地解析为六个句子: Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23. 事实证明,这比我最初想象的要有点
Hello world! How are you? I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause
sentence breaks, like 1.23.
事实证明,这比我最初想象的要有点挑战性
任何帮助都将不胜感激。我将用它在已知的文本体上训练系统。试试这个@”(\S++[!?])(?=\S++$)”
:
结果:
Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.
当然,对于复杂的解析器,您需要一个真正的解析器,比如SharpNLP或NLTK。我的只是一个又快又脏的
以下是信息和功能:
SharpNLP是一个自然资源的集合
用语言编写的语言处理工具
C#。目前它提供
以下NLP工具:
- 分句器
- 标记器
- 词性标记
- chunker(用于“查找非递归语法注释,如名词短语块”)
- 分析器
- 查名字的人
- 共指工具
- WordNet词法数据库的接口
var str=@“你好,世界!你好吗?我很好。这是一个很难的句子,因为我用的是身份证。
var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";
Regex.Split(str, @"(?<=[.?!])\s+").Dump();
换行符也应该被接受。数字不应该像1.23那个样导致句子中断。”;
Regex.Split(str,@)(?不可能使用Regex来解析自然语言。句子的结尾是什么?句点可以出现在许多地方(例如)。您应该使用自然语言分析工具包,如OpenNLP或NLTK。不幸的是,C#中的产品(如果有的话)很少。因此,您可能需要创建Web服务或链接到C#
请注意,如果您依赖于“I.D.”中的精确空格,将来可能会出现问题。您很快就会发现破坏正则表达式的示例。例如,大多数人在其初始值后放置空格
WP()中有一个关于开放和商业产品的优秀总结。我们已经使用了其中的一些产品。这是值得努力的
[你使用“训练”这个词。这通常与机器学习有关(这是NLP的一种方法,用于句子分割)。事实上,我提到的工具包包括机器学习。我怀疑这不是你的意思-而是你会通过启发式进化你的表达。不要!]我使用了这里发布的建议,并提出了正则表达式,以实现我想要做的事情:
(?<Sentence>\S.+?(?<Terminator>[.!?]|\Z))(?=\s+|\Z)
(?\S.+?(?[!?]|\Z))(?=\S++\Z)
我曾经想到:
// using System.Text.RegularExpressions;
/// <summary>
/// Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM
/// Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// [Sentence]: A named capture group. [\S.+?(?<Terminator>[.!?]|\Z)]
/// \S.+?(?<Terminator>[.!?]|\Z)
/// Anything other than whitespace
/// Any character, one or more repetitions, as few as possible
/// [Terminator]: A named capture group. [[.!?]|\Z]
/// Select from 2 alternatives
/// Any character in this class: [.!?]
/// End of string or before new line at end of string
/// Match a suffix but exclude it from the capture. [\s+|\Z]
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// End of string or before new line at end of string
///
///
/// </summary>
public static Regex regex = new Regex(
"(?<Sentence>\\S.+?(?<Terminator>[.!?]|\\Z))(?=\\s+|\\Z)",
RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
// This is the replacement string
public static string regexReplace =
"$& [${Day}-${Month}-${Year}]";
//// Replace the matched text in the InputText using the replacement pattern
// string result = regex.Replace(InputText,regexReplace);
//// Split the InputText wherever the regex matches
// string[] results = regex.Split(InputText);
//// Capture the first Match, if any, in the InputText
// Match m = regex.Match(InputText);
//// Capture all Matches in the InputText
// MatchCollection ms = regex.Matches(InputText);
//// Test to see if there is a match in the InputText
// bool IsMatch = regex.IsMatch(InputText);
//// Get the names of all the named and numbered capture groups
// string[] GroupNames = regex.GetGroupNames();
//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = regex.GetGroupNumbers();
//使用System.Text.RegularExpressions;
///
///为C#on构建的正则表达式:Sun,2009年12月27日,03:05:24 PM
///使用Expresso版本:3.0.3276,http://www.ultrapico.com
///
///正则表达式的说明:
///
///[句子]:命名的捕获组。[\S.+?(?[!?]|\Z]
///\S.+?(?[.!?]|\Z)
///除了空白以外的任何内容
///任何字符,一个或多个重复,尽可能少
///[终止符]:命名的捕获组。[[.!?]|\Z]
///从两个备选方案中进行选择
///此类中的任何字符:[.!?]
///字符串末尾或字符串末尾的新行之前
///匹配后缀,但将其从捕获中排除。[\s+\Z]
///从两个备选方案中进行选择
///空白,一个或多个重复
///字符串末尾或字符串末尾的新行之前
///
///
///
公共静态正则表达式Regex=新正则表达式(
“(?\\S.+?(?[!?]\\\Z))(?=\\S+\\\Z)”,
RegexOptions.CultureInvariant
|RegexOptions.IgnorePatternWhitespace
|RegexOptions.Compiled
);
//这是替换字符串
公共静态字符串regexReplace=
“$&[${Day}-${Month}-${Year}]”;
////使用替换模式替换InputText中匹配的文本
//字符串结果=regex.Replace(InputText,regexReplace);
////在正则表达式匹配的位置拆分InputText
//string[]results=regex.Split(InputText);
////捕获InputText中的第一个匹配项(如果有)
//匹配m=正则表达式匹配(InputText);
////捕获InputText中的所有匹配项
//MatchCollection ms=regex.Matches(InputText);
////测试以查看InputText中是否存在匹配项
//bool IsMatch=regex.IsMatch(InputText);
////获取所有命名和编号的捕获组的名称
//字符串[]GroupNames=regex.GetGroupNames();
////获取所有命名和编号的捕获组的编号
//int[]GroupNumbers=regex.GetGroupNumbers();
大多数人建议使用SharpNLP,除非你想让你的QA部门有一个bug fest,否则你可能应该这样做
但是,因为你可能承受着某种压力。这里有另一种尝试来处理像“Dr.”和“X.”这样的词。但是,它会以“it”结尾的句子失败
你好,世界!你好吗?我很好。这是一个很难的句子
因为我使用ID,所以换行符也应该被接受。数字不应该被接受
引起句子中断,如1.23。幽门螺杆菌见B博士或FooBar先生
心脏评估
var result=new Regex(@“(\S.+?[!?])(?=\S+|$)(?+1,用于将我们指向SharpNLP,这是我以前没有见过的,可能非常有用。最好对(?:\S+|$)使用前瞻性断言
。谢谢您提供的Gumbo信息,它更好,但我必须在前面添加\S,因为左边的空格必须去掉。谢谢大家。这是一个有用的见解。我将在接下来的几天内试用。@Luke:看起来您希望在“原因”和“句子”之间有一个明显的分界在你的示例文本中,但它没有显示。我强迫它显示为I
// using System.Text.RegularExpressions;
/// <summary>
/// Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM
/// Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// [Sentence]: A named capture group. [\S.+?(?<Terminator>[.!?]|\Z)]
/// \S.+?(?<Terminator>[.!?]|\Z)
/// Anything other than whitespace
/// Any character, one or more repetitions, as few as possible
/// [Terminator]: A named capture group. [[.!?]|\Z]
/// Select from 2 alternatives
/// Any character in this class: [.!?]
/// End of string or before new line at end of string
/// Match a suffix but exclude it from the capture. [\s+|\Z]
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// End of string or before new line at end of string
///
///
/// </summary>
public static Regex regex = new Regex(
"(?<Sentence>\\S.+?(?<Terminator>[.!?]|\\Z))(?=\\s+|\\Z)",
RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
// This is the replacement string
public static string regexReplace =
"$& [${Day}-${Month}-${Year}]";
//// Replace the matched text in the InputText using the replacement pattern
// string result = regex.Replace(InputText,regexReplace);
//// Split the InputText wherever the regex matches
// string[] results = regex.Split(InputText);
//// Capture the first Match, if any, in the InputText
// Match m = regex.Match(InputText);
//// Capture all Matches in the InputText
// MatchCollection ms = regex.Matches(InputText);
//// Test to see if there is a match in the InputText
// bool IsMatch = regex.IsMatch(InputText);
//// Get the names of all the named and numbered capture groups
// string[] GroupNames = regex.GetGroupNames();
//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = regex.GetGroupNumbers();
var result = new Regex(@"(\S.+?[.!?])(?=\s+|$)(?<!\s([A-Z]|[a-z]){1,3}.)").Split(input).Where(s => !String.IsNullOrWhiteSpace(s)).ToArray<string>();
foreach (var match in result)
{
Console.WriteLine(match);
}