将html文件加载到php脚本的最快方法_Php_Curl_Web Crawler_Domdocument

将html文件加载到php脚本的最快方法

php curl web-crawler

将html文件加载到php脚本的最快方法,php,curl,web-crawler,domdocument,Php,Curl,Web Crawler,Domdocument,我一直在用php编写一个网络爬虫，我使用以下方法：当前方法 function getPublicationData($url){ static $seen = array(); if (isset($seen[$url])) { return; } $seen[$url] = true; $cURL = curl_init($url); curl_setopt($cURL, CURLOPT_RETURNTRANSFER, tru

我一直在用php编写一个网络爬虫，我使用以下方法：

当前方法

function getPublicationData($url){
    static $seen = array();
    if (isset($seen[$url])) {
        return;
    }
    $seen[$url] = true;

    $cURL = curl_init($url);
    curl_setopt($cURL, CURLOPT_RETURNTRANSFER, true);
    $htmlDoc = curl_exec($cURL);

    $dom= new DOMDocument('1.0');
    libxml_use_internal_errors(true);
    $dom->loadHTML($htmlDoc);
    $dom_xpath = new DOMXPath($dom);

    $strongElements = $dom_xpath->query("//strong[@class='publication-meta-type']");
    foreach( $strongElements as $strongElement){
        echo $strongElement->nodeValue;
    }
}

问题是php有30秒的时间限制，我需要访问相当多的页面（我的主机不允许我更改时间限制）

如果能够从页面或类似的地方只获取几个特定节点，那就太好了

有人能给我一个解决方案吗？

几乎可以肯定，耗时的部分是HTTP请求。你没办法加快速度

解决方案？找到新主机的时间到了。

最耗时的部分几乎肯定是HTTP请求。你没办法加快速度

解决方案？是时候换一台新主机了。

用html异步调用数据库

第一部分

static $seen = array();
if (isset($seen[$url])) {
    return;
}
$seen[$url] = true;

$cURL = curl_init($url);
curl_setopt($cURL, CURLOPT_RETURNTRANSFER, true);
$htmlDoc = curl_exec($cURL);
//save in file, database, whatever

第二部分

static $seen = array();
if (isset($seen[$url])) {
    return;
}
$seen[$url] = true;

$cURL = curl_init($url);
curl_setopt($cURL, CURLOPT_RETURNTRANSFER, true);
$htmlDoc = curl_exec($cURL);
//save in file, database, whatever

创建cron作业，或以其他方式调用函数来解析数据，并保存在数据库中：

$htmlDoc = //get data from whatever you decided to save
$dom= new DOMDocument('1.0');
libxml_use_internal_errors(true);
$dom->loadHTML($htmlDoc);
$dom_xpath = new DOMXPath($dom);

$strongElements = $dom_xpath->query("//strong[@class='publication-meta-type']");
foreach( $strongElements as $strongElement){
    echo $strongElement->nodeValue;
....

使用html对数据库进行异步调用

第一部分

static $seen = array();
if (isset($seen[$url])) {
    return;
}
$seen[$url] = true;

$cURL = curl_init($url);
curl_setopt($cURL, CURLOPT_RETURNTRANSFER, true);
$htmlDoc = curl_exec($cURL);
//save in file, database, whatever

第二部分

static $seen = array();
if (isset($seen[$url])) {
    return;
}
$seen[$url] = true;

$cURL = curl_init($url);
curl_setopt($cURL, CURLOPT_RETURNTRANSFER, true);
$htmlDoc = curl_exec($cURL);
//save in file, database, whatever

创建cron作业，或以其他方式调用函数来解析数据，并保存在数据库中：

$htmlDoc = //get data from whatever you decided to save
$dom= new DOMDocument('1.0');
libxml_use_internal_errors(true);
$dom->loadHTML($htmlDoc);
$dom_xpath = new DOMXPath($dom);

$strongElements = $dom_xpath->query("//strong[@class='publication-meta-type']");
foreach( $strongElements as $strongElement){
    echo $strongElement->nodeValue;
....

@Martin我怀疑OP是在共享主机上。限制PHP执行时间的主机（“我的主机不允许我更改时间限制”）不太可能允许您编辑PHP.ini以删除这些限制。@ceejayoz我刚刚读到了这些内容，因此删除了我的评论：-这个方法运行吗？您是从浏览器还是从某个cron脚本运行它？如何调节您的脚本以跟踪自身，因为页面加载将处于循环结构中？您可以对循环计时，并在25秒后告诉您的脚本标记它的位置，然后调用重新加载

标题来重新加载页面并防止超时，这会让人感觉舒服吗？@moscar这段代码来自wordpress插件。@Martin我怀疑OP是在共享主机上。限制PHP执行时间的主机（“我的主机不允许我更改时间限制”）不太可能允许您编辑PHP.ini以删除这些限制。@ceejayoz我刚刚读到了这些内容，因此删除了我的评论：-这个方法运行吗？您是从浏览器还是从某个cron脚本运行它？如何调节您的脚本以跟踪自身，因为页面加载将处于循环结构中？您可以对循环计时，并在25秒后告诉您的脚本标记它的位置，然后调用重新加载标题来重新加载页面并防止超时，这会让人感觉舒服吗？@moscar这段代码来自wordpress插件。