使用Curl-PHP清除重定向到自身的HTML页面_Php_Html_Curl_Dom_Web Scraping

使用Curl-PHP清除重定向到自身的HTML页面

php html curl dom web-scraping

使用Curl-PHP清除重定向到自身的HTML页面,php,html,curl,dom,web-scraping,Php,Html,Curl,Dom,Web Scraping,所以我想把这一页擦掉：似乎我的代码不能得到整个页面的html代码，它的行为非常古怪我尝试过使用简单的HTMLDOM，但没有任何效果 $base = "http://www.asx.com.au/asx/statistics/todayAnns.do"; $curl = curl_init(); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($curl, CU

所以我想把这一页擦掉：

似乎我的代码不能得到整个页面的html代码，它的行为非常古怪

我尝试过使用简单的HTMLDOM，但没有任何效果

    $base = "http://www.asx.com.au/asx/statistics/todayAnns.do";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, false);
    curl_setopt($curl, CURLOPT_HEADER, false);
    curl_setopt($curl, CURLOPT_URL, $base);
    curl_setopt($curl, CURLOPT_REFERER, $base);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $str = curl_exec($curl);
    curl_close($curl);
    echo htmlspecialchars($str);

这主要显示javascript，我无法获取页面。我的目标是删除url上的中间表。

CURL只能加载页面的标记。加载页面后，上面的页面使用javascript加载数据。您可能必须使用PhantomJS或Splash

此链接可能有助于：

对于获取数据，在服务器端，我们可以使用phantomjs作为PHP内部的库。在phantomjs中执行页面，然后使用exec命令将数据转储到php中

本文有一个逐步的过程来完成这项工作

如果你不需要最新的数据，那么你可以使用谷歌的页面缓存版本

<?php

use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;

require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');

// Create crawler
$crawler = new GeneralCrawler(
    'http://webcache.googleusercontent.com/search?q=cache:http://www.asx.com.au/asx/statistics/todayAnns.do&num=1&strip=0&vwsrc=0'
);

// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//div[@class="page"]//table');
$configuration->setRowXPath('.//tr');
$configuration->setFields(
    [
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Headline',
                'xpath' => './/td[3]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Published',
                'xpath' => './/td[1]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Pages',
                'xpath' => './/td[4]',
            ]
        ),
        new \Scraper\Structure\AnchorField(
            [
                'name'               => 'Link',
                'xpath'              => './/td[5]/a',
                'convertRelativeUrl' => false,
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Code',
                'xpath' => './/text()',
            ]
        ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);

免责声明：我使用了框架和我是那个图书馆的作者。您可以使用简单的curl获取数据，也可以使用上面列出的xpath。我希望这可能会有所帮助：）

我希望有一个php库。。。我需要服务器实时获取这些信息。@SilverSkin您可以在PHP中使用phantomJS库。本文可能会有所帮助：我没有权限在我正在工作的服务器上安装phantomJS。没有其他选择了？太棒了！那正是我要找的！

Array
(
    [0] => Array
        (
            [Code] => ASX
            [hash] => 6e16c02b10a10baf739c2613bc87f906
        )

    [1] => Array
        (
            [Headline] => Initial Director's Interest Notice
            [Published] => 10:57 AM
            [Pages] => 1
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868833
            [Code] => STO
            [hash] => aa2ea9b1b9b0bc843a4ac41e647319b4
        )

    [2] => Array
        (
            [Headline] => Becoming a substantial holder
            [Published] => 10:53 AM
            [Pages] => 2
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868832
            [Code] => AKG
            [hash] => f8ff8dfde597a0fc68284b8957f38758
        )

    [3] => Array
        (
            [Headline] => LBT Investor Conference Call Business Update
            [Published] => 10:53 AM
            [Pages] => 9
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868831
            [Code] => LBT
            [hash] => cc78f327f2b421f46036de0fce270a6d
        )

...