Php DomDocument类无法访问domnode_Php_Html_Parsing_Domdocument

Php DomDocument类无法访问domnode

php html parsing

Php DomDocument类无法访问domnode,php,html,parsing,domdocument,Php,Html,Parsing,Domdocument,我不解析此url： $ch=curl\u init（“http://foldmunka.net"); //curl_setopt（$ch，CURLOPT_NOBODY，true）； curl_setopt（$ch，CURLOPT_RETURNTRANSFER，true）； //curl_setopt（$ch，CURLOPT_头，true）； curl_setopt（$ch，CURLOPT_FOLLOWLOCATION，true）//不需要，除非文件重定向（如我们在这里使用的PHP示例） $dat

我不解析此url：

$ch=curl\u init（“http://foldmunka.net");
//curl_setopt（$ch，CURLOPT_NOBODY，true）；
curl_setopt（$ch，CURLOPT_RETURNTRANSFER，true）；
//curl_setopt（$ch，CURLOPT_头，true）；
curl_setopt（$ch，CURLOPT_FOLLOWLOCATION，true）//不需要，除非文件重定向（如我们在这里使用的PHP示例）
$data=curl\u exec（$ch）；
$info=curl\u getinfo（$ch）；
卷曲关闭（$ch）；
clearstatcache（）；
如果（$data==false）{
echo“cURL失败”；
出口
}
$dom=新的DOMDocument（）；
$data=mb_convert_编码（$data，'HTML-ENTITIES'，“utf-8”）；
$data=preg_replace（“//”，“$data”）；
$data=str_替换（“”，，$data）；
$data=str_替换（“”，，$data）；
$data=preg_replace（'@]*？>.@si'，''$data）；
$data=preg_replace（'@]*？>.@si'，''$data）；
$data=mb_convert_编码（$data，'HTML-ENTITIES'，“utf-8”）；
@$dom->loadHTML（$data）；
$els=$dom->getElementsByTagName（'*'）；
foreach（$els作为$el）{
打印$el->nodeName。“|”。$el->getAttribute（'content'）。“”；
如果（$el->getAttribute（'title'））$el->nodeValue=$el->getAttribute（'title'）。$el->nodeValue；
如果（$el->getAttribute（'alt'））$el->nodeValue=$el->getAttribute（'alt'）。$el->nodeValue；
打印$el->nodeName。“|”。$el->nodeValue。“”；
}

我需要按顺序输入alt、title属性和简单文本，但我无法访问body标记中的节点。

我不确定我是否了解该脚本的功能-替换操作看起来像是在尝试卫生，但如果只是提取代码的某些部分，我不确定是为了什么-但您是否尝试了该脚本？它可能更容易处理解析部分。查看示例。

我不确定我是否理解了此脚本的功能-替换操作看起来像是一种对环境卫生的尝试，但如果您只是提取代码的某些部分，我不确定是为了什么-但是您是否尝试了此脚本？它可能更容易处理解析部分。查看示例。

这里是一个使用DomDocument和DOMXPath的解决方案。它比另一个使用简单HTML DOM解析器的解决方案短得多，运行速度也快得多（约100毫秒对2300毫秒）

<?php

function makePlainText($source)
{
    $dom = new DOMDocument();
    $dom->loadHtmlFile($source);

    // use this instead of loadHtmlFile() to load from string:
    //$dom->loadHtml('<html><title>Hello</title><body>Hello this site<img src="asdasd.jpg" alt="alt attr" title="title attr"><a href="open.php" alt="alt attr" title="title attr">click</a> Some text.</body></html>');

    $xpath = new DOMXPath($dom);

    $plain = '';

    foreach ($xpath->query('//text()|//a|//img') as $node)
    {
        if ($node->nodeName == '#cdata-section')
            continue;

        if ($node instanceof DOMElement)
        {
            if ($node->hasAttribute('alt'))
                $plain .= $node->getAttribute('alt') . ' ';
            if ($node->hasAttribute('title'))
                $plain .= $node->getAttribute('title') . ' ';
        }
        if ($node instanceof DOMText)
            $plain .= $node->textContent . ' ';
    }

    return $plain;
}

echo makePlainText('http://foldmunka.net');

这里是一个使用DomDocument和DOMXPath的解决方案。它比另一个使用简单HTML DOM解析器的解决方案短得多，运行速度也快得多（约100毫秒对2300毫秒）
<?php

function makePlainText($source)
{
    $dom = new DOMDocument();
    $dom->loadHtmlFile($source);

    // use this instead of loadHtmlFile() to load from string:
    //$dom->loadHtml('<html><title>Hello</title><body>Hello this site<img src="asdasd.jpg" alt="alt attr" title="title attr"><a href="open.php" alt="alt attr" title="title attr">click</a> Some text.</body></html>');

    $xpath = new DOMXPath($dom);

    $plain = '';

    foreach ($xpath->query('//text()|//a|//img') as $node)
    {
        if ($node->nodeName == '#cdata-section')
            continue;

        if ($node instanceof DOMElement)
        {
            if ($node->hasAttribute('alt'))
                $plain .= $node->getAttribute('alt') . ' ';
            if ($node->hasAttribute('title'))
                $plain .= $node->getAttribute('title') . ' ';
        }
        if ($node instanceof DOMText)
            $plain .= $node->textContent . ' ';
    }

    return $plain;
}

echo makePlainText('http://foldmunka.net');

这里有一个解决方案，仅供比较。它的输出与的类似，但这一个更复杂，运行速度慢得多（相对于DomDocument的~100ms），因此我不建议使用它：
更新了以处理某些文本中的元素；
回声生成明文（'http://foldmunka.net');
//echo makePlainText（$html，从字符串加载）；
这里有一个解决方案，仅供比较。它的输出与的类似，但这一个更复杂，运行速度慢得多（相对于DomDocument的~100ms），因此我不建议使用它：
更新了以处理某些文本中的元素；
回声生成明文（'http://foldmunka.net');
//echo makePlainText（$html，从字符串加载）；
我需要纯文本、alt和title属性。示例：HelloHello此站点包含一些文本。我需要这个输出：您好，您好，这个站点alt attr title attr alt attr title attr单击一些文本。@turbod简单的HTML DOM浏览器可以同时执行这两项操作。纯文本应该类似于$html->find（“body”，0）->纯文本
查看网站上的示例，了解如何运行所有标记的列表，以获得它们的alt
和title
atributes。是的，现在我阅读了示例，但找不到如何操作。我需要同时使用纯文本、alt和title属性。打印文件\u get\u html（'；这是纯文本打印，但是alt和title属性编号@turbod aahh我现在明白了，您需要按顺序打印。我不明白。哦，这会更困难，我没有解决方案，抱歉。我将保留我的答案，以防止其他人犯同样的错误。您应该将示例编辑到a中请回答清楚。我需要纯文本以及alt和title属性。例如：HelloHello此站点有一些文本。我需要此输出：Hello Hello此站点alt attr title attr alt attr title attr click Some文本。@turbod简单的HTML DOM浏览器可以同时执行这两项操作。纯文本应该类似于$HTML->find（“body”，0）->纯文本
查看网站上的示例，了解如何运行所有标记的列表，以获取它们的alt
和title
属性。是的，现在我阅读了示例，但找不到如何操作。我需要同时使用纯文本和alt和title属性。打印文件\u get\u html（'；这是纯文本打印，但是alt和title属性编号@turbod aahh我现在明白了，您需要按顺序打印。我不明白。哦，这会更困难，我没有解决方案，抱歉。我将保留我的答案，以防止其他人犯同样的错误。您应该将示例编辑到a中请回答清楚。如果有人知道如何使用xpath查询过滤cdata部分，请发表评论。@styu我已按照您的要求进行了查找，但我不理解OP的问题。您可以尝试将选项LIBXML\u NOCDATA
传递给您的load
调用。由于刮取的页面是有效的XHTML，您可能还需要使用XML解析器而不是HTML解析器。@Gordon:turbod澄清说，他想制作网站的纯文本版本，包括a
和img
标记的title
和alt
属性。正如我看到的，load（）
没有按预期工作，但我不知道为什么（在这种情况下它不会提取属性）。@Gordon:使用load（）
而不是loadHtml（）
，有什么好处吗？loadHtml和
<?php
require_once('simple_html_dom.php');
// we are needing this because Simple Html DOM Parser's callback handler
// doesn't handle arguments
static $processed_plain_text = '';

define('LOAD_FROM_URL', 'loadfromurl');
define('LOAD_FROM_STRING', 'loadfromstring');

function callback_cleanNestedAnchorContent($element)
{
    if ($element->tag == 'a')
        $element->innertext = makePlainText($element->innertext, LOAD_FROM_STRING);
}

function callback_buildPlainText($element)
{
    global $processed_plain_text;

    $excluded_tags = array('script', 'style');

    switch ($element->tag)
    {
        case 'text':
            // filter when 'text' is descendant of 'a', because we are
            // processing the anchor tags with the required attributes
            // separately at the 'a' tag,
            // and also filter out other unneccessary tags
            if (($element->parent->tag != 'a') && !in_array($element->parent->tag, $excluded_tags))
                $processed_plain_text .= $element->innertext . ' ';
            break;
        case 'img':
            $processed_plain_text .= $element->alt . ' ';
            $processed_plain_text .= $element->title . ' ';
            break;
        case 'a':
            $processed_plain_text .= $element->alt . ' ';
            $processed_plain_text .= $element->title . ' ';
            $processed_plain_text .= $element->innertext . ' ';
            break;
    }
}

function makePlainText($source, $mode = LOAD_FROM_URL)
{
    global $processed_plain_text;

    if ($mode == LOAD_FROM_URL)
        $html = file_get_html($source);
    elseif ($mode == LOAD_FROM_STRING)
        $html = str_get_dom ($source);
    else
        return 'Wrong mode defined in makePlainText: ' . $mode;

    $html->set_callback('callback_cleanNestedAnchorContent');

    // processing with the first callback to clean up the anchor tags
    $html = str_get_html($html->save());
    $html->set_callback('callback_buildPlainText');

    // processing with the second callback to build the full plain text with
    // the required attributes of the 'img' and 'a' tags, and excluding the
    // unneccessary ones like script and style tags
    $html->save();

    $return = $processed_plain_text;

    // cleaning the global variable
    $processed_plain_text = '';

    return $return;
}

//$html = '<html><title>Hello</title><body>Hello <span>this</span> site<img src="asdasd.jpg" alt="alt attr" title="title attr"><a href="open.php" alt="alt attr" title="title attr">click <span><strong>HERE</strong></span><img src="image.jpg" title="IMAGE TITLE INSIDE ANCHOR" alt="ALTINACNHOR"></a> Some text.</body></html>';

echo makePlainText('http://foldmunka.net');
//echo makePlainText($html, LOAD_FROM_STRING);