PHP句子边界包括空行吗?

PHP句子边界包括空行吗?,php,regex,Php,Regex,这是on SO的扩展 我想知道如何更改正则表达式以保留换行符 将一些文本逐句拆分,删除一个句子,然后重新组合的示例代码: <?php $re = '/# Split sentences on whitespace between them. (?<= # Begin positive lookbehind. [.!?] # Either an end of sentence punct, | [.!?]

这是on SO的扩展

我想知道如何更改正则表达式以保留换行符

将一些文本逐句拆分,删除一个句子,然后重新组合的示例代码:

<?php
$re = '/# Split sentences on whitespace between them.
    (?<=                # Begin positive lookbehind.
      [.!?]             # Either an end of sentence punct,
    | [.!?][\'"]        # or end of sentence punct and quote.
    )                   # End positive lookbehind.
    (?<!                # Begin negative lookbehind.
      Mr\.              # Skip either "Mr."
    | Mrs\.             # or "Mrs.",
    | Ms\.              # or "Ms.",
    | Jr\.              # or "Jr.",
    | Dr\.              # or "Dr.",
    | Prof\.            # or "Prof.",
    | Sr\.              # or "Sr.",
    | T\.V\.A\.         # or "T.V.A.",
                        # or... (you get the idea).
    )                   # End negative lookbehind.
    [\s+|^$]            # Split on whitespace between sentences/empty lines.
    /ix';

$text = <<<EOL
This is paragraph one. This is sentence one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!
EOL;

echo "\nBefore: \n" . $text . "\n";

$sentences = preg_split($re, $text, -1);

$sentences[1] = " "; // remove 'sentence one'

// put text back together
$text = implode( $sentences );

echo "\nAfter: \n" . $text . "\n";
?>
我试图让“后”文本与“前”文本相同,只是删除了一句话

After: 
This is paragraph one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!

我希望这可以通过正则表达式的调整来实现,但是我缺少什么呢?

模式的结尾应该替换为:

  (?:\h+|^$)          # Split on whitespace between sentences\/empty lines.
/mix';

请注意,
[\s+| ^$]
确实匹配空格(水平和垂直,如换行符)、
+
|
^
$
符号,因为它是一个字符类

需要一个组(更好,这里不捕获),而不是一个角色类。在一个组(用
(…)
标记)中,
|
作为一个交替操作符工作

我建议使用仅匹配水平空白(无换行符)的
\s
,而不是
\h

如果未使用
/m
多行修饰符,则
^$
仅匹配空字符串。因此,我在选项中添加了
/m
修饰符


请注意,我必须在最后一条注释中转义
/
,否则会出现一个警告,说明正则表达式不正确。或者,使用不同的正则表达式分隔符。

此正则表达式中似乎存在问题:
[\s+^$]
确实匹配空格、
+
^
$
符号。用
(?:\h+| ^$)
替换它,我想就是这样。我想你可以在
\s
\s{1}
之后删除
+
,如果你真的需要它来匹配一个,因为
\s+
占用了其他空格。基本上,您需要
数组(“stuf”、“\n”、“stuff”)但如果不进行测试就无法确定,而且它太复杂了,无法在我的脑海中运行。谢谢。这几乎奏效,但有一个怪癖:preg_split regex将其中两个句子组合在一起。看到什么主意了吗?还感谢您的解释,我不熟悉它。如果您添加一个
PREG_SPLIT_DELIM_CAPTURE
,使用一个带有
(\h+| ^$)
的捕获组,并将索引2处的元素归零,会怎么样?看见
  (?:\h+|^$)          # Split on whitespace between sentences\/empty lines.
/mix';