Php 使用引用解析文本文件

Php 使用引用解析文本文件,php,regex,parsing,nlp,Php,Regex,Parsing,Nlp,问题是,我正在尝试使用php按句子分割文本文件。我目前正在使用以下功能: $results = preg_split('/(?<=[.?!])\s+/', $stringtest, -1, PREG_SPLIT_NO_EMPTY); 它像这样分割它: [0] In his book The Symposium, Plato wrote “Those who are halves of a man whole pursue males, and being slices, so to sp

问题是,我正在尝试使用php按句子分割文本文件。我目前正在使用以下功能:

$results = preg_split('/(?<=[.?!])\s+/', $stringtest, -1, PREG_SPLIT_NO_EMPTY);
它像这样分割它:

[0] In his book The Symposium, Plato wrote “Those who are halves of a man whole pursue males, and being slices, so to speak, of the male, love men throughout their boyhood, and take pleasure in physical contact with men” (qtd. 
[1] in Isay 11).
另一个例子是:

Dr. Evelyn Hooker, a heterosexual psychologist...
博士部分将是一个问题。

这些文本都来自MASC NLP语料库。

您可以扩展以实现所需。请注意,
$before_regexes
包含已知缩写的列表,请添加语料库中存在的缩写。我在那里添加了
qtd

然后,请注意,
$before\u regex
$before\u regex
是成对的。我添加了
'/(?:[“\'»])\s*\Z/u'
/
'/\A(?:(\p{L})/u'
对,并将其标记为非句子边界(在
$is\u句子边界
数组中的第一个
false
。正则表达式对的意思是:查找引号(
“'>”>)
),0+空格,然后后跟
)(使用
\(
)和任何Unicode字母(
\p{L}
),则不应拆分

function sentence_split($text) {
    $before_regexes = array('/(?:[”’"\'»])\s*\Z/u',
        '/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|qtd)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[\"”\']\s*))\Z/su',
        '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
        '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
    $after_regexes = array('/\A(?:\(\p{L})/u',
        '/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}
功能句\u拆分($text){
$before\u regexes=array('/(?:[“''\'»])\s*\Z/u',
目前,/((::::((::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|教授|博士|先生|女士|[JS]上校|少校|布里格|中士|上尉|森|修订|代表|修订|[A-Z]| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 124|(3:::\bDist\ \单次....................................................................................................................................................................................................................................}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\p{Ps}[!]+p{Pe})\Z/su,
“/(?:(?:[\.\s]\p{L}{1,2}.\s))\Z/su',,
“/(?:(?:[\[\(]*\.\.\.\.[\]\)]*)\Z/su”,
目前,/(以下以下以下::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::[目前,[其他:::[维维维维鉴于鉴于鉴于鉴于鉴于鉴于鉴于鉴于鉴于鉴于目前目前目前目前目前[[维维维维州]维维州]维维维州]维维维州]维维州)维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州维州"苏",,
“/(?:(?:\b[Ee]tc\.\s))\Z/su”,
“/(?:(?:[\.!?…]+\p{Pe})|(?:[\[\(]*…[\]\])]]*)\Z/su',
“/(?:(?:\b\p{L}\)\Z/su',
“/(?:(?:\b\p{L}\.\s))\Z/su”,
“/(?:(?:\b[Ff]igs?\.\s);(?:\b[nN]o\.\s))\Z/su”,
“/(?:(?:[\”“\”]\s*)\Z/su”,
“/(?:(?:[\.!?…])[\x{00BB}\x{2019}\x{201D}\x{203A}\”\'\p{Pe}\x{0002}]*\s);(?:\r?\n))\Z/su',
“/(?:(?:[\.!?…][\”\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*)\Z/su',
“/(?:(?:\s\p{L}[\.!?…]\s))\Z/su”);
$after\u regexes=array('/\A(?:\(\p{L})/u',
“/\A(?)/su”,
'/\A(?[\p{N}\p{Ll}])/su',,
'/\A(?[^\p{Lu}])/su',,
“/\A(?[^\p{Lu}]|I)/su”,
'/\A(?[^p{Lu}])/su',,
'/\A(?:\p{Ll})/su',,
'/\A(?:\p{L}\)/su',,
'/\A(?:\p{L}\.\s)/su',,
'/\A(?:\p{N})/su',,
'/\A(?:\s*\p{Ll})/su',,
“/\A(?)/su”,
'/\A(?:\p{Lu}[^\p{Lu}])/su',,
'/\A(?:\p{Lu}\p{Ll})/su');
$is_Session_boundary=数组(假、假、假、假、假、假、假、假、假、假、假、真、真);
$count=13;
$句子=数组();
$句子='';
$before='';
$after=substr($text,0,10);
$text=substr($text,10);
而($text!=''){
对于($i=0;$i<$count;$i++){
if(preg_match($before_regex[$i],$before)和&preg_match($before_regex[$i],$after)){
如果($is_session_boundary[$i]){
数组_push($句子,$句子);
$句子='';
}
打破
}
}
$first_from_text=$text[0];
$text=substr($text,1);
$first\u from\u after=$after[0];
$after=substr($after,1);
$before.=$first\u from\u after;
$SENTURE.=$first\u from\u after;
$after.=$first\u from\u text;
}
如果($SECTION!=''&&$after!=''){
数组_push($句子,$句子.$after);
}
返回$SECTIONS;
}

查看。

你的问题是什么?@JayBlanchard:我想OP希望在标点符号上进行拆分。但由于它们也存在于其他地方,这就造成了麻烦。我认为正则表达式不是一个很好的工具。@WiktorStribiż新的问题之一是,重复问题的答案并没有考虑以下情况:Dobbens推断大多数父母不会把孩子培养成同性恋;“他们不像‘我的孩子会成为同性恋’”(多本斯)。它将其分为“……同性恋!””和(多本斯)。我不能在那里问后续问题,因为我没有足够的分数。看到了吗
function sentence_split($text) {
    $before_regexes = array('/(?:[”’"\'»])\s*\Z/u',
        '/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|qtd)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[\"”\']\s*))\Z/su',
        '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
        '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
    $after_regexes = array('/\A(?:\(\p{L})/u',
        '/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}