Php Regex/DOMDocument-匹配并替换不在链接中的文本_Php_Regex_Xpath_Preg Replace_Domdocument

Php Regex/DOMDocument-匹配并替换不在链接中的文本

php regex xpath

Php Regex/DOMDocument-匹配并替换不在链接中的文本,php,regex,xpath,preg-replace,domdocument,Php,Regex,Xpath,Preg Replace,Domdocument,我需要以不区分大小写的方式查找和替换所有文本匹配项，除非文本在锚标记内-例如： Match this text and replace it Don't <a href="/">match this text</a> We still need to match this text and replace it 匹配此文本并替换它不要我们仍然需要匹配此文

我需要以不区分大小写的方式查找和替换所有文本匹配项，除非文本在锚标记内-例如：

<p>Match this text and replace it</p>
<p>Don't <a href="/">match this text</a></p>
<p>We still need to match this text and replace it</p>

匹配此文本并替换它
不要
我们仍然需要匹配此文本并替换它

搜索“匹配此文本”只会替换第一个实例和最后一个实例

[Edit]根据Gordon的评论，在这种情况下，最好使用DOMDocument。我对DOMDocument扩展一点也不熟悉，如果能提供一些基本的示例，我将不胜感激


<?php
$a = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace it</p>
';
$res = preg_replace("#[^<a.*>]match this text#",'replacement',$a);
echo $res;
?>

这种方法有效。希望你想要真正区分大小写，所以用小写字母匹配

使用正则表达式解析HTML是一个巨大的挑战，它们很容易变得过于复杂并占用大量内存。我认为最好的办法是：

preg_replace('/match this text/i','replacement text');
preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");

preg_replace（'/match this text/i'，'replacement text'）；
preg_replace（'/（]*>[^（试试这个：
$dom = new DOMDocument;
$dom->loadHTML($html_content);

function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
  if (!empty($dom->childNodes)) {
    foreach ($dom->childNodes as $node) {
      if ($node instanceof DOMText && 
          !in_array($node->parentNode->nodeName, $excludeParents)) 
      {
        $node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
      } 
      else
      {
        preg_replace_dom($regex, $replacement, $node, $excludeParents);
      }
    }
  }
}

preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));

这是一种无堆栈的非递归方法，使用DOM树的预顺序遍历
  libxml_use_internal_errors(TRUE);
  $dom=new DOMDocument('1.0','UTF-8');

  $dom->substituteEntities=FALSE;
  $dom->recover=TRUE;
  $dom->strictErrorChecking=FALSE;

  $dom->loadHTMLFile($file);
  $root=$dom->documentElement;
  $node=$root;
  $flag=FALSE;
  for (;;) {
      if (!$flag) {
          if ($node->nodeType==XML_TEXT_NODE &&
              $node->parentNode->tagName!='a') {
              $node->nodeValue=preg_replace(
                  '/match this text/is',
                  $replacement, $node->nodeValue
              );
          }
          if ($node->firstChild) {
              $node=$node->firstChild;
              continue;
          }
     }
     if ($node->isSameNode($root)) break;
     if ($flag=$node->nextSibling)
          $node=$node->nextSibling;
     else
          $node=$node->parentNode;
 }
 echo $dom->saveHTML();

libxml\u use\u internal\u errors（TRUE）；
和$dom=new DOMDocument；
之后的3行代码应该能够处理任何格式错误的HTML。
$a='匹配此文本并替换它
$a='<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace it</p>';

echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);

不要
我们仍然需要匹配此文本并替换它”；
echo preg_replace（“~匹配您可以使用的文本（？！[^。它类似于DOMDocument，但我认为它更易于使用。
以下是与之并行的备选方案：
我刚刚针对我的解决方案分析了这段代码（witch打印了完全相同的输出），DomDocument（毫不奇怪）要快得多（约4毫秒对约77毫秒）。
这是一个UTF-8安全解决方案，它不仅适用于格式正确的文档，而且适用于文档片段
需要mb_convert_编码，因为loadHtml（）似乎有UTF-8编码的错误（请参阅和）
mb_substr正在从输出中修剪body标记，这样您就可以在没有任何额外标记的情况下返回原始内容
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';

$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
    $replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
    $newNode  = $dom->createDocumentFragment();
    $newNode->appendXML($replaced);
    $node->parentNode->replaceChild($newNode, $node);
}

// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");

在这里使用DOM并调整锚内嵌套标记的首选行为，如这是
？很抱歉，这在很多情况下都不起作用。现在，您正在寻找“匹配此文本”，前面加上除
以外的任何字符…这段代码真的无法完成任务。有十几位参议员认为这无法完成任务。巨大的挑战是一个很好的表达方式：）有点轻描淡写，嗯？：）对于某些事情来说，这几乎是不可能的。但这项小任务几乎是可以管理的。很好的尝试，这是一个很好的方法“替换回”确实避免了此操作的几个潜在陷阱，但我认为您的解决方案在嵌套标记、跨多行的标记和其他几种情况下仍然会失败。唯一正确的方法是使用实际解析DOM的东西。@Caleb-同意。（尽管我添加了s修改器，使其适用于多行上的标记。）我认为在标记中嵌套标记并不常见。这取决于OP需要它的健壮程度，取决于它的使用位置。建议使用第三方替代方法代替字符串解析：，并且。@Gordon:我认为所有这些都是通过解析字符串（包括DOMDocument）来构建DOM的。问题是他们是如何做到这一点的（例如，他们是将文档与不需要的实体混在一起，还是只是在做自己的工作）。速度不是真正的问题，因为您只希望在文档被修改时处理它。无论如何，感谢您的建议，我将进一步研究它们。@styu所有这些都基于DOM，DOM使用libxml。@Gordon可能libxml中有一个bug，但如果所有人都使用DOM，那么所有人都有相同的问题（它们只是同一个库的不同包装器）.phpQuery和ZendŐDom在没有DocType声明的情况下工作得很好，但是他们都不能处理UTF-8编码。他们正在将ţŦŐ转换为ÃŦ或Ã；Ã；Ã；&141；°；°；&144；如果您知道Dom的正确解决方案，请描述它，我会很高兴地使用它。@styu Dom在UTF-8上工作得很好，但不需要tr除非你告诉它，否则任何形式的东西。如果你需要使用DOM帮助，可以自由地把它变成一个问题，我可能倾向于回答它。就像D+ 1给DOM一个尝试：它不考虑在<代码> < /C>元素的文本节点内的内联元素。<代码> //Text（）的XPath [不（祖先：：A）]
将只返回DOMText
树之外的节点。实际上，我认为目前为止的答案都没有考虑到这一点。@Gordon，你能为这个案例提供一个文本字符串吗？@styu这是
-当你迭代//text的结果时，你将得到文档中的所有文本节点。你只需要挑出那些有直接父元素
的，但不要挑出上面有元素的。@Gordon我已经根据你的建议编辑了我的答案。@styu我最终通过添加$replaced=str_replace（'&'，'&'&；'，$replaced）解决了这个问题-这有效地用xml实体替换了符号和
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';

$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
    $replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
    $newNode  = $dom->createDocumentFragment();
    $newNode->appendXML($replaced);
    $node->parentNode->replaceChild($newNode, $node);
}

// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");