Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/262.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Php 在数组中分解随机的不可预测标记_Php_Arrays_Dom_Innerhtml - Fatal编程技术网

Php 在数组中分解随机的不可预测标记

Php 在数组中分解随机的不可预测标记,php,arrays,dom,innerhtml,Php,Arrays,Dom,Innerhtml,下面是包装在div标记中的一些随机的不可预测的标记集。如何在HTML中分解所有子标记,并保持其出现的顺序 注意:对于img,iframe标记只需要提取url <div> <p>para-1</p> <p>para-2</p> <p> text-before-image <img src="text-image-src"/> text-after-image</p>

下面是包装在div标记中的一些随机的不可预测的标记集。如何在HTML中分解所有子标记,并保持其出现的顺序

注意:对于img,iframe标记只需要提取url

 <div>
  <p>para-1</p>
  <p>para-2</p>
  <p>
    text-before-image
    <img src="text-image-src"/>
    text-after-image</p>
  <p>
    <iframe src="p-iframe-url"></iframe>
  </p>
  <iframe src="iframe-url"></iframe>
  <h1>header-1</h1>
  <img src="image-url"/>
  <p>
    <img src="p-image-url"/>
  </p>
  content not wrapped within any tags
  <h2>header-2</h2>
  <p>para-3</p>
  <ul>
    <li>list-item-1</li>
    <li>list-item-2</li>
  </ul>
  <span>span-content</span>
 content not wrapped within any tags
</div>
相关代码:

 ["para-1","para-2","text-before-image","text-image-src","text-after-image",
"p-iframe-url","iframe-url","header-1","image-url",
"p-image-url","content not wrapped within any tags","header-2","para-3",
"list-item-1","list-item-2","span-content","content not wrapped within any tags"]
 $dom     = new DOMDocument();
        @$dom->loadHTML( $content );
        $tags = $dom->getElementsByTagName( 'p' );
        // Get all the paragraph tags, to iterate its nodes.
        $j = 0;
        foreach ( $tags as $tag ) {
            // get_inner_html() to preserve the node's text & tags
            $con[ $j ] = $this->get_inner_html( $tag );
            // Check if the Node has html content or not
            if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {      
                // Check if the node contains html along with plain text with out any tags
                if ( $tag->nodeValue != '' ) {
                    /*
                     * DOM to get the Image SRC of a node
                     */
                    $domM      = new DOMDocument();
                    /*
                     * Setting encoding type http://in1.php.net/domdocument.loadhtml#74777
                     * Set after initilizing DomDocument();
                     */
                    $con[ $j ] = mb_convert_encoding( $con[ $j ], 'HTML-ENTITIES', "UTF-8" );
                    @$domM->loadHTML( $con[ $j ] );
                    $y = new DOMXPath( $domM );
                    foreach ( $y->query( "//img" ) as $node ) {
                        $con[ $j ] = "img=" . $node->getAttribute( "src" );
                        // Increment the Array size to accomodate bad text and image tags.
                        $j++;
                        // Node incremented, fetch the node value and accomodate the text without any tags.
                        $con[ $j ] = $tag->nodeValue;
                    }
                    $domC      = new DOMDocument();
                    @$domC->loadHTML( $con[ $j ] );
                    $z = new DOMXPath( $domC );
                    foreach ( $z->query( "//iframe" ) as $node ) {
                        $con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
                        // Increment the Array size to accomodate bad text and image tags.

                        $j++;
                        // Node incremented, fetch the node value and accomodate the text without any tags.
                        $con[ $j ] = $tag->nodeValue;
                    }
                } else {
                    /*
                     * DOM to get the Image SRC of a node
                     */
                    $domA      = new DOMDocument();
                    @$domA->loadHTML( $con[ $j ] );
                    $x = new DOMXPath( $domA );
                    foreach ( $x->query( "//img" ) as $node ) {
                        $con[ $j ] = "img=" . $node->getAttribute( "src" );
                    }

                    if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
                        foreach ( $x->query( "//iframe" ) as $node ) {
                            $con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
                        }
                    }
                }
            }
            // INcrement the node
            $j++;
        }

        $this->content = $con;

最简单的方法是使用DOMDocument:
试试递归方法!在类实例上获取一个空数组
$parts
,并获取一个函数
extractSomething(DOMNode$source)
。函数应该处理每个单独的案例,然后返回。如果源是一个

  • TextNode:推送到$parts
  • Element and name=img:将其href推到$parts
  • 其他特殊情况
  • 元素:对于每个TextNode或元素子级调用extractSomething(子级)
现在,当对extractSomenting(yourRootDiv)的调用返回时,列表将显示在$this->parts中

请注意,您尚未定义sometext1 sometext2会发生什么情况,但上面的示例将代表它添加3个元素(“sometext1”、“ref”和“sometext2”)

这只是解决方案的大致轮廓。关键是,您需要处理树中的每个节点(可能并不真正考虑其位置),在以正确的顺序遍历它们的同时,通过将每个节点转换为所需的文本来构建数组。递归是最快的编码方式,但您也可以尝试宽度遍历或walker工具

底线是,您必须完成两项任务:按正确的顺序遍历节点,将每个节点转换为所需的结果


这基本上是处理树/图结构的经验法则。

从DOM文档中提取感兴趣的信息片段的一种快速简便的方法是使用XPath。下面是一个基本示例,演示如何从div元素获取文本内容和属性文本

<?php

// Pre-amble, scroll down to interesting stuff...
$html = '<div>
  <p>para-1</p>
  <p>para-2</p>
  <p>
    <iframe src="p-iframe-url"></iframe>
  </p>
  <iframe src="iframe-url"></iframe>
  <h1>header-1</h1>
  <img src="image-url"/>
  <p>
    <img src="p-image-url"/>
  </p>
  content not wrapped within any tags
  <h2>header-2</h2>
  <p>para-3</p>
  <ul>
    <li>list-item-1</li>
    <li>list-item-2</li>
  </ul>
  <span>span-content</span>
 content not wrapped within any tags
</div>';

$doc = new DOMDocument;
$doc->loadHTML($html);
$div = $doc->getElementsByTagName('div')->item(0);

// Interesting stuff:

// Use XPath to get all text nodes and attribute text
// $tests becomes a DOMNodeList filled with DOMText and DOMAttr objects
$xpath = new DOMXPath($doc);
$texts = $xpath->query('descendant-or-self::*/text()|descendant::*/@*', $div);

// You could only include/exclude specific attributes by looking at their name
// e.g. multiple paths: .//@src|.//@href
// or whitelist:        descendant::*/@*[name()="src" or name()="href"]
// or blacklist:        descendant::*/@*[not(name()="ignore")]

// Build an array of the text held by the DOMText and DOMAttr objects
// skipping any boring whitespace
$results = array();
foreach ($texts as $text) {
    $trimmed_text = trim($text->nodeValue);
    if ($trimmed_text !== '') {
        $results[] = $trimmed_text;
    }
}

// Let's see what we have
var_dump($results);

@jeroen使用domapi,成功地在html中只提取了标记,保留了它的出现。但当存在p标记以外的标记时失败。为什么不干脆
strip_tags()
?这将拉出所有包含的html,只留下文本,并按照html/文本在文件中出现的顺序进行操作。@MarcB如果只剥离标签()您不想获取“innerHTML”,iframe&image路径会发生什么情况。您希望查看如何检索“属性”(例如iframe src)的值以及元素的“文本内容”。这些关键字应该可以让您继续。如果您向我们展示代码的相关部分,我们将更好地了解您选择的方法,如果您没有遇到概念错误(仅限),我们甚至可能会发现错误。成功地在HTML中仅提取标记,保留其出现。但是当存在p标记以外的标记时失败。请添加一些关于op如何使用DOMDocument的解释,也许有一些示例。谢谢您的回答。这段代码运行得非常好。但是,数组是用不需要的DOMAttr对象(样式、高度、宽度、alt、rel等等)构建的。如何放弃它?将XPath表达式更改为仅匹配您感兴趣的属性。这可能涉及更改
@*
部分,或向白名单或黑名单属性名称添加谓词(过滤器)。这完全取决于您。非常感谢@salathe,XPath表达式
$text=$XPath->query('sundant-or-self:*/text()| sundant:*/*/*[name()=“src”或name()=“href”],$div)解决了。现在,假设我必须将带有class remove

的p标记列入黑名单,请不要显示给我。我如何过滤它?谢谢…例外情况应该在foreach($text)循环中过滤。如果$text->nodeType和/或$text->nodeName和/或$text的其他属性将节点标识为“坏”,请不要将其添加到结果中。您可以通过匹配不属于该类段落的元素来筛选出该元素。。。e、 g.

后代或self::*[不是(self::p[@class=“remove”])]
--看起来您应该在周末花些时间阅读XPath。一个好的介绍是。