Php 使用xPath刮取src属性值，_Php_Xpath

Php 使用xPath刮取src属性值，

php xpath

Php 使用xPath刮取src属性值，,php,xpath,Php,Xpath,我想使用xPath作为抓取工具来抓取一些图像。但是xPath无法找到src属性，尽管我可以在网站的源代码中看到这些属性通常我应该对图像的属性进行优化，但xPath不返回任何内容 $html = pageContent($link."photo"); $path = new \DOMXPath($html); $route = $path->query("//ul[@class='categoryBox']//li[@class='photoList_item']/a/img"); for

我想使用xPath作为抓取工具来抓取一些图像。但是xPath无法找到

src

属性，尽管我可以在网站的源代码中看到这些属性

通常我应该对图像的属性进行优化，但xPath不返回任何内容

$html = pageContent($link."photo");
$path = new \DOMXPath($html);
$route = $path->query("//ul[@class='categoryBox']//li[@class='photoList_item']/a/img");
foreach($route as $val){
    $images[] = trim($val->getAttribute("src"));
}

var_dump($images);

网站是：你可以在这里查看路径

可能的原因是什么

如果您需要在此处查看

pageContent（）

函数：

function pageContent(String $url): \DOMDocument
{
    $html = cache()->rememberForever($url, function () use ($url) {

        $opts = array(
            "http" => array(
            "method"=>"GET",
            "header"=>"Content-Type: text/html; charset=utf-8"
            )
        );

        $context = stream_context_create($opts);
        $result = @file_get_contents($url,false,$context);
        return $result;
    });

    libxml_use_internal_errors(true);

    $parser = new \DOMDocument();
    $parser->loadHTML($html = mb_convert_encoding($html,"HTML-ENTITIES", "ASCII, JIS, UTF-8, EUC-JP, SJIS"));
    return $parser;
}

除了上面的打字错误修复之外，由于它是为延迟加载而动态加载的，所以您需要以另一种方式将其作为目标

如果您仔细检查：

<a data-lightbox="tile10" href="/uploads/hall_photo/174/1/0/main_0.jpg?1566895565" onClick="ga('send', 'event', 'kanto', 'hall/photo', 'photo/1_0_main0_174', 1, {nonInteraction: true});">
    <img alt="アニヴェルセル 柏 挙式会場" width="750" height="330" class="lazy" data-original="/uploads/hall_photo/174/1/0/main_0_s.jpg?1566895565" />
    <noscript><img alt="アニヴェルセル 柏 挙式会場" width="750" height="330" src="/uploads/hall_photo/174/1/0/main_0_s.jpg?1566895565" /></noscript>
</a>

很多站点在加载页面后使用JS动态加载数据。一个简单的测试是，当您阅读代码中的页面时，将

$html

保存到本地文件，然后查看该文件-而不是在浏览器中，因为这可能会触发JS！检查你所期望的源文件，如果它不存在，那么检查JS。你的意思是，如果属性不在

$html

@nigelrenth中，我就无法访问

属性。$html
@nigelrenth这些天来，整个世界都缺乏平等……你的代码也是如此，ul[@class'categoryBox']
–缺少=
！事实上，这是非常错误的。不是问题的来源：）很抱歉@谢谢你的回答。我不知道你的产量是多少。但在我的回信中，返回的是空的…。@mr.hello-il在一个minute@mr.hello在这里，只要遵循它，它应该是相当相同的。只需按小提琴上的“Run”键即可执行代码是的，小提琴上的操作非常有效。我想我把后面的东西弄糟了。但是谢谢你指出了一条不同的路径。@mr.hello我刚刚修改了你的函数，因为在沙盒演示中没有cache（）->rememberForever（）方法。但是我想你现在明白了，只要简单地在data属性或者甚至noscript标记->img标记中使用src来定位它，就可以了。很高兴这有帮助
$html = pageContent('https://hana-yume.net/174/photo/');
$path = new \DOMXPath($html);
$images = [];
$route = $path->query("//ul[@class='categoryBox']//li[contains(@class, 'photoList_item')]/a/img");
foreach($route as $val){
    $images[] = trim($val->getAttribute('data-original'));
}