Php 在正则表达式中不允许的标记处截断字符串_Php_Regex

Php 在正则表达式中不允许的标记处截断字符串

php regex

Php 在正则表达式中不允许的标记处截断字符串,php,regex,Php,Regex,我得到了这个很好的正则表达式，可以与php的preg_match_all一起使用，以匹配一个字符串，该字符串在句子/字符串中的特定单词之前包含0到x行，之后包含0到y行： '(?:[^\.?!<]*[\.?!]+){0,x}(?:[^\.?!]*)'.$word.'(?:[^\.?!]*)(?:[\.?!]+[^\.?!]*){0,y}'.'(?:[\.?!]+)' 从这个完整的字符串： <div readability="120"><p>Tradition, E

我得到了这个很好的正则表达式，可以与php的preg_match_all一起使用，以匹配一个字符串，该字符串在句子/字符串中的特定单词之前包含0到x行，之后包含0到y行：

'(?:[^\.?!<]*[\.?!]+){0,x}(?:[^\.?!]*)'.$word.'(?:[^\.?!]*)(?:[\.?!]+[^\.?!]*){0,y}'.'(?:[\.?!]+)'

从这个完整的字符串：

<div readability="120"><p>Tradition, Expansion, Exile.<br/>Individual paths in Chinese contemporary art </p><p>The contemporary <i>art world</i> craves for novelty: the best reason for Chinese art to be so trendy is also the <strong>worst one</strong>.</p><div>

传统、扩张、放逐。
中国当代艺术的个人道路
当代艺术界渴望新奇：中国艺术如此流行的最佳原因也是最糟糕的原因

这意味着在本例中，

是允许的标记，而
和
是不允许的
假设您将
div
和
span
标记定义为“非法”，以下正则表达式将匹配包含
$word
的句子前后的
x
句子，只要这些句子不包含“非法”标记：

'(?:(?<=[.!?]|^)(?:(?<!<div|<\/div|<span|<\/span)>|[^>.!?])+[.!?]+){0,x}[^.!?]*'.$word.'[^.!?]*[.!?]+(?:(?:<(?!\/?div|\/?span)|[^<.!?])*[.!?]+){0,y}'

“（？：（？！？]）+[！？]+）{0，x}[^.！？]*”.$word.[^.！？]*[.！？]+（？：（？：发布一些输入和输出示例。嗨，拉里，谢谢你关注这个问题。我添加了一个随机输入输出示例。希望你能帮助我！如果你想用PHP解析HTML，我建议可能会有用。有一个讨论，关于为什么尝试用正则表达式解析HTML不是一个好主意，在非主题的答案是使用例如QueryPath并使用qp（$html）->find（“div p”）->html（）；选择所需内容。不确定长度或其他匹配异常，太模糊。--使用正则表达式匹配当然可行，但需要付出更多努力。因此，如果您不精通，请尝试另一种方法。解析（html）的小片段到目前为止，我想使用这个正则表达式的内容工作得很好。正如我在问题中所写的，第一段代码工作得很好。唯一的问题是，我不知道如何在我不允许的标记处切断输出字符串。因此，我认为讨论不是关于是否要这样做，我并不想对您的评论置之不理使用正则表达式解析此内容，但要对已经工作的代码进行最后一点微调，这样，如果在我不允许的句子之前或句子中出现标记，它就不会寻找更多完整的句子。谢谢你的完整答案kopischke！我已经在重新考虑，并且将重新考虑一种程序方式来切断文本在使用“句子正则表达式”之前输入非法标记。我认为这样做会更快，你不觉得吗？再次感谢你的努力，特别是解释过的正则表达式代码，非常酷！谢谢。 <div readability="120"><p>Tradition, Expansion, Exile.<br/>Individual paths in Chinese contemporary art </p><p>The contemporary <i>art world</i> craves for novelty: the best reason for Chinese art to be so trendy is also the <strong>worst one</strong>.</p><div> '(?:(?<=[.!?]|^)(?:(?<!<div|<\/div|<span|<\/span)>|[^>.!?])+[.!?]+){0,x}[^.!?]*'.$word.'[^.!?]*[.!?]+(?:(?:<(?!\/?div|\/?span)|[^<.!?])*[.!?]+){0,y}' // 0 TO X LEADING SENTENCES (?: ---------------------------------// do not create a capture group (?<=[.!?]|^) ----------------------// match only after sentence end or start of string (?: -------------------------------// do not create a capture group (?<!<div|<\/div|<span|<\/span)> -// match “>” only if not preceded by span or div tags |[^>.!?] ------------------------// or any any other, non punctuation character )+ --------------------------------// one or more times [.!?]+ ----------------------------// followed by one or more punctuation characters ){0,x} ------------------------------// the whole sentence repeated 0 to x times // MIDDLE SENTENCE WITH KEYWORD [^.!?]* -----------------------------// match 0 or more non-punctuation characters $word -------------------------------// match string value of $word [^.!?]* -----------------------------// match 0 or more non-punctuation characters [.!?]+ ------------------------------// followed by one or more punctuation characters // 0 TO Y TRAILING SENTENCES (?: ---------------------------------// do not create a capture group <(?!<\/?div|\/?span) --------------// match “<” not followed by a “div” or “span” tag |[^<.!?] --------------------------// or any non-punctuation character that is not “<” )* --------------------------------// zero or more times [.!?]+ ----------------------------// followed by one or more punctuation characters ){0,y} ------------------------------// the whole sentence repeated 0 to y times