PHP文档解析HTML_Php_Xpath_Html Parsing_Domdocument

PHP文档解析HTML

php xpath

PHP文档解析HTML,php,xpath,html-parsing,domdocument,Php,Xpath,Html Parsing,Domdocument,我有以下HTML标记 <div contenteditable="true" class="text"></div> <div contenteditable="true" class="text"></div> <div style="display: block;" class="ui-draggable">

我有以下HTML标记

<div contenteditable="true" class="text"></div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
    <img class='avatar' src=""/>
    <p style="">
    <img class='pic' src=""/><br>
    <span class='fulltext' style="display:none"></span>
    </p>-<span class='create'></span>
    <a class='permalink' href=""></a>
    </div>
 <div contenteditable="true" class="text"></div>
 <div style="display: block;" class="ui-draggable">
    <img class='avatar' src=""/>
    <p style="">
    <img class='pic' src=""/><br>
    <span class='fulltext' style="display:none"></span>
    </p><span class='create'></span><a class='permalink' href=""></a>
    </div>

父div可以更多-

$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div');
$i=0;
$q=1;
foreach($div as $book) {
    $attr = $book->getAttribute('class');
    //if div contenteditable
    if($attr == 'text') {
        echo '</br>'.$book->nodeValue."</br>";  
    }
    
    else {
        $new = new DOMDocument();
        $newxpath = new DOMXPath($new);
        $avatar = $xpath->query("(//img[@class='avatar']/@src)[$q]");
        
        $picture = $xpath->query("(//p/img[@class='pic']/@src)[$q]");
        $fulltext = $xpath->query("(//p/span[@class='fulltext'])[$q]");
        $permalink = $xpath->query("(//a[@class='permalink'])[$q]");
        echo $permalink->item(0)->nodeValue; //date
        echo $permalink->item(0)->getAttribute('href');
        echo $fulltext->item(0)->nodeValue;
        echo $avatar->item(0)->value;
        echo $picture->item(0)->value;
        $q++;
    }
    $i++;
}

$dom=newdomdocument（）；
$dom->loadHTML（$xml）；
$xpath=newdomxpath（$dom）；
$div=$xpath->query（'//div'）；
$i=0；
$q=1；
foreach（$div作为$book）{
$attr=$book->getAttribute（'class'）；
//如果div内容是可编辑的
如果（$attr=='text'）{
回显“
”.$book->nodeValue.
”；
}
否则{
$new=新文档（）；
$newxpath=newdomxpath（$new）；
$avatar=$xpath->query（//img[@class='avatar']/@src）[$q]”；
$picture=$xpath->query（//p/img[@class='pic']/@src）[$q]”；
$fulltext=$xpath->query（//p/span[@class='fulltext']）[$q]；
$permalink=$xpath->query（//a[@class='permalink']）[$q]”；
echo$permalink->item（0）->nodeValue；//日期
echo$permalink->item（0）->getAttribute（'href'）；
echo$fulltext->item（0）->nodeValue；
echo$avatar->item（0）->value；
echo$picture->item（0）->值；
$q++；
}
$i++；
}

但是我认为有更好的方法来解析HTML。有？提前感谢您

事实上，您的做法是正确的：必须使用DOM对象解析html。然后可以进行一些优化：

$div = $xpath->query('//div');

非常贪婪，getElementsByTagName应该更合适：

$div = $dom->getElementsByTagName('div');

请注意，它支持另一个名为

contextparam

的参数。此外，在循环中不需要第二个DOMDocument和DOMXPath。使用：

$avatar = $xpath->query("img[@class='avatar']/@src", $book);

获取相对于div节点的

属性节点。如果你听从我的建议，你的榜样应该是好的

下面是您的代码的一个版本，它遵循上述说明：

$dom = new DOMDocument();
$dom->loadHTML($xml);

$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div');

foreach($divs as $book) {
    $attr = $book->getAttribute('class');
    if($attr == 'text') {
        echo '</br>'.$book->nodeValue."</br>";  
    } else {
        $avatar = $xpath->query("img[@class='avatar']/@src", $book);
        $picture = $xpath->query("p/img[@class='pic']/@src", $book);
        $fulltext = $xpath->query("p/span[@class='fulltext']", $book);
        $permalink = $xpath->query("a[@class='permalink']", $book);
        echo $permalink->item(0)->nodeValue; //date
        echo $permalink->item(0)->getAttribute('href');
        echo $fulltext->item(0)->nodeValue;
        echo $avatar->item(0)->value;
        echo $picture->item(0)->value;
    }
}

$dom=newdomdocument（）；
$dom->loadHTML（$xml）；
$xpath=newdomxpath（$dom）；
$divs=$xpath->query（'//div'）；
foreach（$divs作为$book）{
$attr=$book->getAttribute（'class'）；
如果（$attr=='text'）{
回显“
”.$book->nodeValue.
”；
}否则{
$avatar=$xpath->query（“img[@class='avatar']/@src”，$book）；
$picture=$xpath->query（“p/img[@class='pic']/@src”，$book）；
$fulltext=$xpath->query（“p/span[@class='fulltext']”，$book）；
$permalink=$xpath->query（“a[@class='permalink']”，$book）；
echo$permalink->item（0）->nodeValue；//日期
echo$permalink->item（0）->getAttribute（'href'）；
echo$fulltext->item（0）->nodeValue；
echo$avatar->item（0）->value；
echo$picture->item（0）->值；
}
}

$avatar=$avatar没有用是的，我错过了。Thanks我对$q
@artragis的用法表示怀疑。请注意，这两条语句将返回相同的值。在任何情况下，都会缓冲.getElementsByTagName，因此它在内存中的贪婪程度较低。让我在@internals list上找到消息并将其作为证据显示给您。“尝试获取非对象的属性”-echo$picture->..
，echo$fulltext->..
您能将完整的HTML发布到pastebin吗？非常好。非常感谢你。最后一个问题-nodeValue
、value
和textValue
之间的区别是什么？在上面的示例中，您有时会选择doElement节点->nodeValue、DOMAttribute节点->值。。我不确定textValue。应该是DOMTextNode的值，或者是DOMElementNode的子节点的文本、扁平表示