Php 使用正则表达式将字符串拆分为句子

Php 使用正则表达式将字符串拆分为句子,php,regex,unicode,nlp,Php,Regex,Unicode,Nlp,我将随机文本存储在$句子中。使用正则表达式,我想将文本拆分为句子,请参见: function splitSentences($text) { $re = '/ # Split sentences on whitespace between them. (?<= # Begin positive lookbehind. [.!?] # Either an end o

我将随机文本存储在
$句子中
。使用正则表达式,我想将文本拆分为句子,请参见:

function splitSentences($text) {
    $re = '/                # Split sentences on whitespace between them.
        (?<=                # Begin positive lookbehind.
          [.!?]             # Either an end of sentence punct,
        | [.!?][\'"]        # or end of sentence punct and quote.
        )                   # End positive lookbehind.
        (?<!                # Begin negative lookbehind.
          Mr\.              # Skip either "Mr."
        | Mrs\.             # or "Mrs.",
        | T\.V\.A\.         # or "T.V.A.",
                            # or... (you get the idea).
        )                   # End negative lookbehind.
        \s+                 # Split on whitespace between sentences.
        /ix';

    $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
    return $sentences;
}

$sentences = splitSentences($sentences);

print_r($sentences);
或者这种情况:

$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.";
当文本中存在unicode字符时,我可以做些什么使其工作

这是一个测试的例子

赏金信息
我正在寻找一个完整的解决方案。在发布答案之前,请阅读我与WiktorStribiżew的评论帖子,以了解有关此问题的更多相关信息

是将UTF-8字符U+00A0非中断空格打印到被解释为拉丁语-1的页面/控制台时的样子。所以我认为句子之间有一个不间断的空格,而不是一个正常的空格


\s
也可以匹配非中断空格,但您需要使用
/u
修饰符告诉preg您正在向其发送UTF-8编码字符串。否则,它就像您的print命令一样,会猜测拉丁语-1,并将其视为两个字符

,正如预期的那样,任何自然语言处理都不是一项简单的任务。原因是它们是进化系统。没有一个人坐下来思考哪些是好主意,哪些不是。每个规则都有20-40%的例外。有了这一点,一个单一的正则表达式能够满足你的要求的复杂性将是不可能的。不过,下面的解决方案主要依赖于正则表达式


  • 我们的想法是逐步复习课文
  • 在任何给定的时间,文本的当前块将包含在两个不同的部分中。一个是句子边界前的子串候选,另一个是后面的子串候选
  • 前10个正则表达式对检测看起来像句子边界但实际上不是的位置。在这种情况下,在没有注册新句子的情况下,提前执行before和after
  • 如果这些对都不匹配,将尝试与最后3对匹配,可能检测到边界

至于这些正则表达式来自哪里我翻译了,这是根据。如果你真的想了解他们,除了阅读报纸别无选择

就准确性而言,我鼓励你用不同的文本来测试它。经过一些实验,我感到非常惊喜

在性能方面-正则表达式应该具有很高的性能,因为它们都有
\a
\Z
锚定,几乎没有重复量词,在有重复量词的地方-不可能有任何回溯。不过,正则表达式是正则表达式。如果你打算在大块的文本上使用这个is紧循环,你必须做一些基准测试


强制性免责声明:请原谅我生疏的php技能。下面的代码可能不是有史以来最惯用的php代码,它应该仍然足够清晰,可以让人理解这一点


功能句\u拆分($text){
10月10 10 10 10)s);(((((((((((((((::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::((((:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::)根根根根根根根方方方方方方方,本本本本本方,本本方,本方,本军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长军士长|(3:::\BDC\\农村地区.....................................................................................................................................................................................................................................;(?:\b\p{Lu}.\s\p{Lu}.\s);(?:\bcf\.\s);(?:\be\.g\.\s);(?:\besp\.\s);(?:\bet\b\s\bal\.\s);(?:\bvs\.\124;(?:\ p{Ps}[!]+\p{Pe}\Z/su,
“/(?:(?:[\.\s]\p{L}{1,2}.\s))\Z/su',,
“/(?:(?:[\[\(]*\.\.\.\.[\]\)]*)\Z/su”,
“/(?:(?:)\b(?:pp |[Vv]iz | i\?\s*e |[Vvol]|[Rr]col | maj |[Ff]igs |[Vv]iz |[Vv]ols |[Aa]pprox |[Ii]ncl | Pres ept |[Dd ept | min | max |[Gg t]ovs | | s | | s | | s | | | | s | | | s | s |,
“/(?:(?:\b[Ee]tc\.\s))\Z/su”,
“/(?:(?:[\.!?…]+\p{Pe})|(?:[\[\(]*…[\]\])]]*)\Z/su',
“/(?:(?:\b\p{L}\)\Z/su',
“/(?:(?:\b\p{L}\.\s))\Z/su”,
“/(?:(?:\b[Ff]igs?\.\s);(?:\b[nN]o\.\s))\Z/su”,
“/(?:(?:[\”“\”]\s*)\Z/su”,
“/(?:(?:[\.!?…])[\x{00BB}\x{2019}\x{201D}\x{203A}\”\'\p{Pe}\x{0002}]*\s);(?:\r?\n))\Z/su',
“/(?:(?:[\.!?…][\”\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*)\Z/su',
“/(?:(?:\s\p{L}[\.!?…]\s))\Z/su”);
$after_regexes=数组('/\A(?)/su',
'/\A(?[\p{N}\p{Ll}])/su',,
'/\A(?[^\p{Lu}])/su',,
“/\A(?[^\p{Lu}]|I)/su”,
'/\A(?[^p{Lu}])/su',,
'/\A(?:\p{Ll})/su',,
'/\A(?:\p{L}\)/su',,
'/\A(?:\p{L}\.\s)/su',,
'/\A(?:\p{N})/su',,
'/\A(?:\s*\p{Ll})/su',,
“/\A(?)/su”,
'/\A(?:\p{Lu}[^\p{Lu}])/su',,
'/\A(?:\p{Lu}\p{Ll})/su');
$is_句子_边界=数组(false,false,false,false,false,false,false,false,false,true,true);
$count=13;
$句子=数组();
$句子='';
$before='';
$after=substr($text,0,10);
$text=substr($text,10);
而($text!=''){
对于($i=0;$i<$count;$i++){
if(preg_match($before_regex[$i],$before)和&preg_match($before_regex[$i],$after)){
如果($is_session_boundary[$i]){
数组_push($句子,$句子);
$句子='';
}
打破
}
}
$first_from_text=$text[0];
$text=substr($text,1);
$first\u from\u after=$after[0];
$after=substr($after,1);
$before.=$first\u from\u after;
$SENTURE.=$first\u from\u after;
$after.=$first\u from\u text;
}
如果($SECTION!=''&&$after!=''){
数组_push($句子,$句子.$after);
}
返回$SECTIONS;
}
$text=“Mr.Entertainment media properties.仙女尾巴3.5和东京食尸鬼。”;
打印(句子分割($text));
如果空格不可靠
$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.";
function sentence_split($text) {
    $before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[\"”\']\s*))\Z/su',
        '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
        '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
    $after_regexes = array('/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));
function splitSentences($text) {
    $re = '/                # Split sentences ending with a dot
        .+?                 # Match everything before, until we find
        (
          $ |               # the end of the string, or
          \.                # a dot
          (?<!              #  Begin negative lookbehind.
            Mr\.            #   Skip either "Mr."
          | Mrs\.           #   or "Mrs.",
                            #   or... (you get the idea).
          )                 #   End negative lookbehind.
          "?                #   Optionally match a quote
          \s*               #   Any number of whitespaces
          (?=               #  Begin positive lookahead
            \p{Lu} |        #   an upper case letter, or
            "               #   a quote
          )
        )
        /iux';

    if (!preg_match_all($re, $text, $matches, PREG_PATTERN_ORDER)) { 
        return [];
    }

    $sentences = array_map('trim', $matches[0]);

    return $sentences;
}

$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
$sentences = splitSentences($text);

print_r($sentences);
<?php
    require_once 'classes/autoloader.php'; // Include the autoloader.
    $text   = "Hello there, Mr. Smith. What're you doing today... Smith,"
            . " my friend?\n\nI hope it's good. This last sentence will"
            . " cost you $2.50! Just kidding :)"; // This is the test text we're going to use
    $Sentence   = new Sentence;   // Create a new instance
    $sentences  = $Sentence->split($text); // Split into array of sentences
    $count      = $Sentence->count($text); // Count the number of sentences
?>
<?php
include ('vendor/autoload.php');
 
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Classifiers\ClassifierInterface;
use \NlpTools\Documents\DocumentInterface;
 
class EndOfSentence implements ClassifierInterface
{
    public function classify(array $classes, DocumentInterface $d) {
        list($token,$before,$after) = $d->getDocumentData();
 
        $dotcnt = count(explode('.',$token))-1;
        $lastdot = substr($token,-1)=='.';
 
        if (!$lastdot) // assume that all sentences end in full stops
            return 'O';
 
        if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
            return 'O';
 
        return 'EOW';
    }
}
$tok = new ClassifierBasedTokenizer(
    new EndOfSentence(),
    new WhitespaceTokenizer()
);
$text = "We are what we repeatedly do.
        Excellence, then, is not an act, but a habit.";
 
print_r($tok->tokenize($text));
 
// Array
// (
//    [0] => We are what we repeatedly do.
//    [1] => Excellence, then, is not an act, but a habit.
// )
 
$txt = preg_replace('~\p{P}+~', "$0 ", $txt);
<?php


    function splitSentences($text) {
        $re = '/# Split sentences on whitespace between them.
            (?<=                # Begin positive lookbehind.
              [.!?]             # Either an end of sentence punct,
            | [.!?][\'"]        # or end of sentence punct and quote.
            )                   # End positive lookbehind.
            (?<!                # Begin negative lookbehind.
              Mr\.              # Skip either "Mr."
            | Mrs\.             # or "Mrs.",
            | Ms\.              # or "Ms.",
            | Jr\.              # or "Jr.",
            | Dr\.              # or "Dr.",
            | Prof\.            # or "Prof.",
            | Vol\.             # or "Vol.",
            | A\.D\.            # or "A.D.",
            | B\.C\.            # or "B.C.",
            | Sr\.              # or "Sr.",
            | T\.V\.A\.         # or "T.V.A.",
                                # or... (you get the idea).
            )                   # End negative lookbehind.
            \s+                 # Split on whitespace between sentences.
            /uix';

        $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
        return $sentences;
    }

$sentences = 'Entertainment media properties. Ã Fairy Tail and Tokyo Ghoul. Entertainment media properties. &Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.';

$sentences = splitSentences($sentences);

print_r($sentences);
\s+                 # Split on whitespace between sentences.