Php domDocument-识别a<；br/>；_Php_Domdocument

Php domDocument-识别a<；br/>；

php

Php domDocument-识别a<；br/>；,php,domdocument,Php,Domdocument,我正在使用domDocument解析一些HTML，并希望用\n替换中断。但是，我在确定文档中实际发生中断的位置时遇到问题给出以下HTML片段-来自我正在使用$dom->loadHTMLFile（$pFilename）读取的更大文件：我得到： PARAGRAPH: Multiple-line paragraphthat has a close tag BREAK: 如何确定段落中该中断符的位置，并将其替换为\n 或者有没有比使用domDocument更好的方法来解析可能格式正确或可能格式不正

我正在使用domDocument解析一些HTML，并希望用\n替换中断。但是，我在确定文档中实际发生中断的位置时遇到问题

给出以下HTML片段-来自我正在使用$dom->loadHTMLFile（$pFilename）读取的更大文件：

我得到：

PARAGRAPH: Multiple-line paragraphthat has a close tag
BREAK:

如何确定段落中该中断符的位置，并将其替换为\n

或者有没有比使用domDocument更好的方法来解析可能格式正确或可能格式不正确的HTML？

使用

getElementsByTagName

无法获取元素的位置。您应该检查每个元素的

childNodes

，并分别处理文本节点和元素

在一般情况下，需要递归，如下所示：

function processElement(DOMNode $element){
    foreach($element->childNodes as $child){
        if($child instanceOf DOMText){
            echo $child->nodeValue,PHP_EOL;
        }elseif($child instanceOf DOMElement){
            switch($child->nodeName){
            case 'br':
                echo 'BREAK: ',PHP_EOL;
                break;
            case 'p':
                echo 'PARAGRAPH: ',PHP_EOL;
                processElement($child);
                echo 'END OF PARAGRAPH;',PHP_EOL;
                break;
            // etc.
            // other cases:
            default:
                processElement($child);
            }
        }
    }
}

$D = new DOMDocument;
$D->loadHTML('<p>Multiple-line paragraph<br />that has a close tag</p>');
processElement($D);

既然您不必处理子节点和其他东西，为什么不直接替换br呢

$str = '<p>Multiple-line paragraph<br />that has<br>a close tag</p>';
echo preg_replace('/<br\s*\/?>/', "\n", $str);

$str='多行段落
有一个结束标记；
echo preg_替换（“//”、“\n”、$str）；

输出：

<p>Multiple-line paragraph
that has
a close tag</p>

多行段落
那已经
贴身标签

备选方案（使用Dom）：

$str='多行
段落
具有
结束标记
；
$dom=新的DomDocument（）；
$dom->loadHtml（$str）；
//在这里使用xpath，因为不管怎样，它都会找到每个br标记
//它是否自动关闭
$xpath=newdomxpath（$dom）；
foreach（$xpath->query（'//br'）作为$br）{
$br->parentNode->replaceChild（$dom->createTextNode（“\n”），$br）；
}
//输出整个html
echo$dom->saveHtml（）；
//或者只是body子节点
$output=''；
foreach（$xpath->query（'//body/*'）作为$bodyChild）{
$output.=$dom->saveXml（$bodyChild）；
}
echo$输出；

我编写了一个简单的类，它不使用递归，应该更快/消耗更少的内存，但基本上与@Hrant Khachatrian的基本思想相同（遍历所有元素并查找子标记）：

类DomScParser{
公共静态函数find（DOMNode&$parent\u node，$tag\u name）{
//检查我们是否已获得自包含节点
if（！$parent\u node->childNodes->length）{
if（$parent\u node->nodeName==$tag\u name）{
返回$parent\u节点；
}
}
//初始化路径数组
$dom\u path=array（$parent\u node->firstChild）；
//初始化找到的节点数组
$found_dom_arr=array（）；
//在路径中有元素时进行迭代
而（$dom\u path\u size=count（$dom\u path））{
//获取路径中的最后一个元素
$current\u node=end（$dom\u path）；
//如果它是一个空元素-此处无需执行任何操作，
//我们应该后退一步。
如果（！$当前节点）{
数组\u pop（$dom\u path）；
继续；
}
if（$current_node->firstChild）{
//若节点有子节点—将其第一个子节点添加到路径的末尾。
//因为我们正在寻找没有子节点的自包含节点，
//这个节点不是我们正在寻找的-相应的更改
//给他兄弟姐妹的遗赠。
$dom_path[]=$current_node->firstChild；
$dom\u path[$dom\u path\u size-1]=$current\u node->nextSibling；
}否则{
//检查我们是否找到了正确的节点，如果没有-更改相应的节点
//给他兄弟姐妹的遗赠。
if（$current\u node->nodeName==$tag\u name）{
$found\u dom\u arr[]=$current\u node；
}
$dom\u path[$dom\u path\u size-1]=$current\u node->nextSibling；
}
}
返回$found\u dom\u arr；
}
公共静态函数replace（DOMNode&$parent\u node、$search\u tag\u name、$replace\u tag）{
//检查我们是否让节点替换找到的节点或只是一些文本。
if（！$replace_标记DOMNode的实例）{
//获取DomDocument对象
if（$DOMDocument的父节点实例）{
$dom=$parent\u节点；
}否则{
$dom=$parent\u node->ownerDocument；
}
$replace_tag=$dom->createTextNode（$replace_tag）；
}
$find\u tags=self:：find（$parent\u node，$search\u tag\u name）；
foreach（$found\u标记为&$found\u标记）{
$found\u tag->parentNode->replaceChild（$replace\u tag->cloneNode（），$found\u tag）；
}
}
}
$D=新文件；
$D->loadHTML（'test1
test2'）；
DomScParser:：replace（$D，'br'，“\n”）；

另外，它不应该在多个嵌套标记上中断，因为它不使用递归。html示例：

$html=str_repeat('<b>',100).'<br />'.str_repeat('</b>',100);

$html=str_repeat（“”，100）。
。str_repeat（“”，100）；

嗯，您可以随时使用regex。或者比DD更简单的是

打印htmlqp（$html）->查找（“br”）->替换为（“\n”）->顶部（“正文”）->html（）嗯…嗯。。。。似乎是一个很有可能的想法。我将用一些真实世界的例子来试一试，看看它是如何工作的。谢谢。漂亮、快速、干净，并且允许我处理所有其他“独立”标签，例如
。我甚至可以高效地处理嵌套表。目前，我的测试代码看起来像意大利面条，但这是我的问题（目前我仍处于逻辑的实验阶段，当我对所有功能都满意后，我将对其进行重构），该方法具有巨大的潜力，尽管我可能要处理包含各种标记的更大的用户生成HTML文件，我想我更喜欢使用$dom->loadHTMLFile（$pFilename）而不是手动加载、str_替换，然后使用$dom->loadHTML（$htmlString）。。。特别是对于记忆overheads@MarkBaker除了br-搜索之外，您还会在标记上做其他事情吗？还有很多，其中大部分我可以通过基本的循环和切换测试来处理，但主要是提取表以生成二维数组，同样是带有样式的文本（内联和css）。。。其核心目的是为HTML文件创建一个合适的阅读器
$str = '<p>Multiple-line paragraph<br />that has<br>a close tag</p>';
echo preg_replace('/<br\s*\/?>/', "\n", $str);

<p>Multiple-line paragraph
that has
a close tag</p>

$str = '<p>Multiple-line<BR>paragraph<br />that<BR/>has<br>a close<Br>tag</p>';

$dom = new DomDocument();
$dom->loadHtml($str);

// using xpath here, because it will find every br-tag regardless
// of it being self-closing or not
$xpath = new DomXpath($dom);
foreach ($xpath->query('//br') as $br) {
  $br->parentNode->replaceChild($dom->createTextNode("\n"), $br);
}

// output whole html
echo $dom->saveHtml();

// or just the body child-nodes
$output = '';
foreach ($xpath->query('//body/*') as $bodyChild) {
  $output .= $dom->saveXml($bodyChild);
}

echo $output;

class DomScParser {

    public static function find(DOMNode &$parent_node, $tag_name) {
        //Check if we already got self-contained node
        if (!$parent_node->childNodes->length) {
            if ($parent_node->nodeName == $tag_name) {
                return $parent_node;
            }
        }
        //Initialize path array
        $dom_path = array($parent_node->firstChild);
        //Initialize found nodes array
        $found_dom_arr = array();
        //Iterate while we have elements in path
        while ($dom_path_size = count($dom_path)) {
            //Get last elemant in path
            $current_node = end($dom_path);
            //If it is an empty element - nothing to do here,
            //we should step back in our path.
            if (!$current_node) {
                array_pop($dom_path);
                continue;
            }

            if ($current_node->firstChild) {
                //If node has children - add it first child to end of path.
                //As we are looking for self-contained nodes without children,
                //this node is not what we are looking for - change corresponding
                //path elament to his sibling.
                $dom_path[] = $current_node->firstChild;
                $dom_path[$dom_path_size - 1] = $current_node->nextSibling;
            } else {
                //Check if we found correct node, if not - change corresponding
                //path elament to his sibling.
                if ($current_node->nodeName == $tag_name) {
                    $found_dom_arr[] = $current_node;
                }
                $dom_path[$dom_path_size - 1] = $current_node->nextSibling;
            }
        }
        return $found_dom_arr;
    }

    public static function replace(DOMNode &$parent_node, $search_tag_name, $replace_tag) {
        //Check if we got Node to replace found node or just some text.
        if (!$replace_tag instanceof DOMNode) {
            //Get DomDocument object
            if ($parent_node instanceof DOMDocument) {
                $dom = $parent_node;
            } else {
                $dom = $parent_node->ownerDocument;
            }
            $replace_tag=$dom->createTextNode($replace_tag);
        }
        $found_tags = self::find($parent_node, $search_tag_name);
        foreach ($found_tags AS &$found_tag) {
            $found_tag->parentNode->replaceChild($replace_tag->cloneNode(),$found_tag);
        }
    }

}

$D = new DOMDocument;
$D->loadHTML('<span>test1<br />test2</span>');
DomScParser::replace($D, 'br', "\n");

$html=str_repeat('<b>',100).'<br />'.str_repeat('</b>',100);