php语句边界检测_Php_Regex_Nlp_Text Segmentation

php语句边界检测

php regex nlp

php语句边界检测,php,regex,nlp,text-segmentation,Php,Regex,Nlp,Text Segmentation,我想用PHP把一个文本分成几个句子。我目前正在使用一个正则表达式，它带来了约95%的准确率，并希望通过使用更好的方法来改进。我在Perl、Java和C中看到过这样做的NLP工具，但没有看到任何适合PHP的工具。你知道这样的工具吗 > P>作为一种低技术的方法，您可能需要考虑使用一系列爆炸> 循环中的调用，使用！然后呢？就像你的针一样。这将非常占用内存和处理器（就像大多数文本处理一样）。您将有一组临时数组和一个主数组，其中所有找到的句子都按正确的顺序进行了数字索引此外，您还必须检查常见的异常（例

我想用PHP把一个文本分成几个句子。我目前正在使用一个正则表达式，它带来了约95%的准确率，并希望通过使用更好的方法来改进。我在Perl、Java和C中看到过这样做的NLP工具，但没有看到任何适合PHP的工具。你知道这样的工具吗

> P>作为一种低技术的方法，您可能需要考虑使用一系列<代码>爆炸> <代码>循环中的调用，使用！然后呢？就像你的针一样。这将非常占用内存和处理器（就像大多数文本处理一样）。您将有一组临时数组和一个主数组，其中所有找到的句子都按正确的顺序进行了数字索引

此外，您还必须检查常见的异常（例如，像先生和博士这样的标题中的a.），但由于所有内容都在一个数组中，这些类型的检查应该不会那么糟糕

我不确定这在速度和伸缩性方面是否比regex好，但值得一试。你想把这些文本块分成多大的句子？

我在用这个正则表达式：

preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);

因为我认为速度比准确性更重要。

建立一个这样的缩写列表

$skip_array = array ( 

'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.

将它们编译成表达式

$skip = '';
foreach($skip_array as $abbr) {
$skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]';
}

最后一次运行这个preg_分裂成句子

$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
                     $txt, -1, PREG_SPLIT_NO_EMPTY);

$lines=preg_split（“/”）一种增强的正则表达式解决方案
假设您确实关心处理：Mr.
和Mrs.
等缩写，那么下面的单一正则表达式解决方案非常有效：
对他人工作的轻微改进：
$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?]             # Either an end of sentence punct,
| [.!?][\'"]        # or end of sentence punct and quote.
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| Sr\.              # or "Sr.",
| \s[A-Z]\.              # or initials ex: "George W. Bush",
                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.
/ix';

@ridgerunner我用C编写了你的PHP代码#
结果我得到了两个句子：

J.Dujardin régle sa T.V.先生
特别是唯一性

正确的结果应该是这样一句话：J.Dujardin régle sa T.V.A.en esp.uniquement先生
还有我们的测试段落
string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";

结果是
index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!

C#代码：
string sText=“J.Dujardin régle先生特别是uniquement先生”；
正则表达式rx=新正则表达式（@“（\S.+？）？
[.！？]#要么是句尾点刺，
|[.！？]['“”]#或句尾点字和引号。
)
（？你在使用什么正则表达式？PHP中的NLP听起来会让你感到非常痛苦。“痛苦”是因为它比C慢？这是我正在使用的正则表达式：preg_split（“/（？这个库对你有用吗？这并不能回答我的问题，因为我正在寻找一个能为我做这件事的库。但是，你能解释一下使用explode和preg_split的区别吗？@Noam:explode（）
在一个简单的字符串匹配上拆分，而不做任何正则表达式。答案的含义是，对于您的用例来说，它应该足够简单，可以不做正则表达式；即只在每个常见的标点符号上分解。然而，我同意，它并不能真正回答您的问题，甚至不能解决您试图问的问题。您的目标是accuraCY，这不是他所关注的。（但是如果你要采用这种方法，我会考虑<代码> Strtok（））/>代码>比<代码>爆炸（）/>代码>由于多个标点字符所导致的更好的解决方案。“我的吊带在哪里？”“定界符是< /代码>”“空格”。PHP将“代码>爆炸\ < /代码>当遇到空白时，您的字符串变成碎片。在这种情况下，导致四个单词存储在<代码>数组< /代码>中，作为<代码>键< /代码> [0-3]。分隔符可以是任何东西，&，#，-，：
等等。preg_split
是一个更复杂的分解符，它包含了大量的元字符、开关、函数和表达式，如上面的示例所示。这仍然是一种非常直接的方法。我正在寻找通过一个学习的过程。你的解决方案忽略了许多选项。@giorgio79：是的，如果“省略”"由一行三个点组成。如果您谈论的是一个表示省略号的Unicode字符，则需要将此Unicode字符添加到字符类中，此正则表达式才能工作。@Noam-如果您特别想要基于机器学习的解决方案，请更新您的问题。使用此增强的正则表达式解决方案，我如何检测“T.V.A”字？我做这个[T\124; T]\.[V\124; V]\.[A\124; A]\.\或“T.V.A”，
但它没有work@PapyRef-是的，很容易。看看正则表达式。看看例外列表？例如，Mr\.\124; Mrs\.\124; Ms\.\124;等。
？只需将您的T\.V.a\.
术语添加到此列表中，并用or运算符将其与其他术语分开。（不要忘记您需要避开这些点。）你能解释一下你的实际进步吗？
string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";

index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!

                string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
                Regex rx = new Regex(@"(\S.+?
                                       [.!?]               # Either an end of sentence punct,
                                       | [.!?]['""]         # or end of sentence punct and quote.
                                       )
                                       (?<!                 # Begin negative lookbehind.
                                          Mr.                   # Skip either Mr.
                                        | Mrs.                  # or Mrs.,
                                        | Ms.                   # or Ms.,
                                        | Jr.                   # or Jr.,
                                        | Dr.                   # or Dr.,
                                        | Prof.                 # or Prof.,
                                        | Sr.                   # or Sr.,
                                        | \s[A-Z].              # or initials ex: George W. Bush,
                                        | T\.V\.A\.             # or "T.V.A."
                                       )                    # End negative lookbehind.
                                       (?=|\s+|$)", 
                                       RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
                foreach (Match match in rx.Matches(sText))
                {
                    Console.WriteLine("index: {0}  sentence: {1}", match.Index, match.Value);
                }