PHP-解析html文件_Php_Html_Dom - Fatal编程技术网

PHP-解析html文件

php html dom

PHP-解析html文件,php,html,dom,Php,Html,Dom,我需要解析一些html文件，问题是这个html的结构充满了标记，我不在乎，我没有我将要找到的规范我想直接在一个标记下获取所有标记，排除其他子节点，并读取的子节点（如果有），但我可以排除一些特定标记例如： <body> <node1> <node1> <a> </a> &l

我需要解析一些html文件，问题是这个html的结构充满了标记，我不在乎，我没有我将要找到的规范

我想直接在一个标记下获取所有标记，排除其他子节点，并读取的子节点（如果有），但我可以排除一些特定标记

例如：

<body>
  <node1>
    <p></p>
  <node1>
  <p></p>
  <p>
    <a>
    </a>
  <p>
</body>

我将遍历每个子节点并检查nodesName

如何在第一个查询中排除body与p不同的子节点？我尝试了$dom->getElementsByTagNamebody/p；但它不起作用有没有更好的方法来管理和查询我的HTML文件？

谢谢

可能不是最好的方式，但它可以为某人完成任务：

<?php
$html = "<!DOCTYPE html><head></head><body>
  <node1>
    <p>1</p>
  <node1>
  <p>2</p>
  <p>3
    <a>
    </a>
  <p>
</body>     
</html>
";      

$doc = new DOMDocument;
// our HTML might not be perfectly valid so we don't want to be warned about it
libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors(false);

// Count all p tags first
$ptagscntr = $doc->getElementsByTagName('p');
// if we have p tags
if ($ptagscntr->length) {
    // iterate
    for ($i=0;$i<$ptagscntr->length;$i++)  {
        // get each node
        var_dump($doc->getElementsByTagName('p')->item($i));
        echo '<br>----<br>';
    }
}
?>

我使用类似的代码从他们发布的文章中清理用户脚本片段及其内容，这样可能会有所帮助，只需将数组中的标记替换为您的标记即可。结果将是html，没有您指定的标记以及它们的后代

$dom = new DOMDocument();

// $body is your html you load into this variable elsewhere
// Note that there will be a warning if any invalid tags like
// node1 will be loaded, but in most cases it will continue to work
@$dom->loadHTML($body);

$tags_to_remove = array('node1', 'node2');

// Collect and remove the tags with everything they hold.
$remove = array();

foreach ($tags_to_remove as $removal_target) {
    $sentenced = $dom->getElementsByTagName($removal_target);

    foreach ($sentenced as $item) {
        $remove[] = $item;
    }
}

foreach ($remove as $sentenced_item) {
    $sentenced_item->parentNode->removeChild($sentenced_item);
}

// Code below is used to get html without wrapping html>body and doctype
// added by DOMDocument
$body = '';

$body_node = $dom->getElementsByTagName('body')->item(0);

foreach ($body_node->childNodes as $child) {
    $body .= $dom->saveHTML($child);
}

。请注意，将html加载到dom对象之前有错误抑制运算符@。我用它来获得干净的输出。例如，我强烈建议不要使用它

你介意用jquery或干净的javascript来解决它吗？@Bandon两者都不介意，因为问题是关于PHP的。你应该研究一下alsoI是否可以用xslt来解决这个问题

$dom = new DOMDocument();

// $body is your html you load into this variable elsewhere
// Note that there will be a warning if any invalid tags like
// node1 will be loaded, but in most cases it will continue to work
@$dom->loadHTML($body);

$tags_to_remove = array('node1', 'node2');

// Collect and remove the tags with everything they hold.
$remove = array();

foreach ($tags_to_remove as $removal_target) {
    $sentenced = $dom->getElementsByTagName($removal_target);

    foreach ($sentenced as $item) {
        $remove[] = $item;
    }
}

foreach ($remove as $sentenced_item) {
    $sentenced_item->parentNode->removeChild($sentenced_item);
}

// Code below is used to get html without wrapping html>body and doctype
// added by DOMDocument
$body = '';

$body_node = $dom->getElementsByTagName('body')->item(0);

foreach ($body_node->childNodes as $child) {
    $body .= $dom->saveHTML($child);
}