Php 如果p标记后跟任何其他标记，则xpath不返回文本_Php_Xpath

Php 如果p标记后跟任何其他标记，则xpath不返回文本

php xpath

Php 如果p标记后跟任何其他标记，则xpath不返回文本,php,xpath,Php,Xpath,我想获取以下HTML的和标记之间的所有文本 <div class="bodyText"> <p> <div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight"> <div class="one"> <a href="url" class="img"><img src="url" alt

我想获取以下HTML的

和

标记之间的所有文本

<div class="bodyText">
  <p>
    <div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
  <div class="one">
  <a  href="url" class="img"><img src="url" alt="bar" class="img" width="80" height="60" /></a>
  </div>
  <div class="two">
    <h4 class="preTitle">QIEZ-Lieblinge</h4>
    <h3 class="title"><a  href="url"  title="ABC"  onclick="cmsTracking.trackClickOut({element:this,  channel : 32333770, channelname : 'top_listen',  content : 14832081,  callTemplate : '_htmltagging.Text',  action : 'click',  mouseevent : event});">
        Prominente Gastronomen      </a></h3>
    <span class="postTitle"></span>
    <span class="district"><a href="http://www.qiez.de/berlin/top-listen" title="TOP-LISTEN in Berlin">Berlin</a></span>  </div>
  <div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
    <div class="inlineImage alignLeft">
  <div class="medium">
    <img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
    <span class="caption">
      Schöne Lage: das Restaurant Ø. (c)QIEZ    </span>
  </div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
    "Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
  <div class="clear"></div>
</div>

但是如果

标记后面跟有任何其他标记，它就不会给出文本，这是混合内容。根据定义元素位置的内容，可以使用许多因素。在此cse中，可能只需选择所有文本节点即可：

//div[contains(@class, 'bodyText')]/(p | h3)/text()

如果处理器中不允许使用路径位置内的union运算符，则可以像以前一样使用语法，或者我认为更简单一点：

//div[contains(@class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()

看起来p元素中包含了div元素，这是无效的，会把事情搞砸。如果在循环中使用var_dump，您可以看到它确实拾取了节点，但nodeValue为空

对html的一个快速而肮脏的修复方法是将p元素中包含的第一个div包装在一个span中

<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>

如果您没有对源html的控制权。您可以复制html并删除有问题的div：

$nodes = $xpath->query("//div[contains(@class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);

使用简单的html dom可能更容易。也许你可以试试这个：

include（'simple_html_dom.php'）；
$dom=新的简单html\U dom（）；
$dom->load（$html）；
foreach（$dom->find（“div[class=bodyText]”）作为$parent）{
foreach（$parent->children（）作为$child）{
如果（$child->tag=='p'| |$child->tag=='h3'）{
//删除p元素中包含的div的内部文本
foreach（$dom->find（'div'）作为$e）
$e->innertext=''；
echo$child->纯文本。“
”；
}
}
}

我在这两个表达式中都遇到了此错误：-警告错误：DOMXPath:：query（）：无效的表达式，如XPath处理器可能不符合标准，或者您必须以不同方式调用它。您可能需要编辑您的问题并添加相关的PHP代码。我无法修改html，因为它是一个外部源代码。我添加了一些关于如果您无法控制源代码该怎么办的想法。

$xpath->query("//div[contains(@class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");

$nodes = $xpath->query("//div[contains(@class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);

include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);

foreach($dom->find("div[class=bodyText]") as $parent) {
    foreach($parent->children() as $child) {
        if ($child->tag == 'p' || $child->tag == 'h3') {
            // remove the inner text of divs contained within a p element
            foreach($dom->find('div') as $e) 
                $e->innertext = '';
            echo $child->plaintext . '<br>';
        }
    }
}