Php Can'；t使用函数解析某些链接的标题_Php_Web Scraping_Simple Html Dom

Php Can'；t使用函数解析某些链接的标题

php web-scraping

Php Can'；t使用函数解析某些链接的标题,php,web-scraping,simple-html-dom,Php,Web Scraping,Simple Html Dom,我已经编写了一个脚本来解析每个页面的标题，在使用从这个页面填充的链接之后。更清楚地说：我下面的脚本应该解析登录页上的所有链接，然后重用这些链接，以便深入一层并解析那里的文章标题由于这是我第一次尝试用php编写任何东西，我不知道哪里出了问题这是我迄今为止的尝试： <?php include("simple_html_dom.php"); $baseurl = "https://stackoverflow.com"; function get_links($baseurl) { $

我已经编写了一个脚本来解析每个页面的标题，在使用从这个页面填充的链接之后。更清楚地说：我下面的脚本应该解析登录页上的所有链接，然后重用这些链接，以便深入一层并解析那里的文章标题

由于这是我第一次尝试用

php

编写任何东西，我不知道哪里出了问题

这是我迄今为止的尝试：

<?php
include("simple_html_dom.php");
$baseurl = "https://stackoverflow.com";
function get_links($baseurl)
{
    $weburl = "https://stackoverflow.com/questions/tagged/web-scraping";
    $html   = file_get_html($weburl);
    $processed_links = array();
    foreach ($html->find(".summary h3 a") as $a) {
            $links           = $a->href . '<br>';
            $processed_links[] = $baseurl . $links;

        }
        return implode("\n",$processed_links);
}
function reuse_links($processed_links){
    $ihtml = file_get_html($processed_links);
    foreach ($ihtml -> find("h1 a") as $item) {
        echo $item->innertext;
    }
}
$pro_links = get_links($baseurl);
reuse_links($pro_links);
?>

执行脚本时，会产生以下错误：

Warning: file_get_contents(https://stackoverflow.com/questions/52347029/getting-all-the-image-urls-from-a-given-instagram-user<br> https://stackoverflow.com/questions/52346719/unable-to-print-links-in-another-function<br> https://stackoverflow.com/questions/52346308/bypassing-technical-limitations-of-instagram-bulk-scraping<br> https://stackoverflow.com/questions/52346159/pulling-the-href-from-a-link-when-web-scraping-using-python<br> https://stackoverflow.com/questions/52346062/in-url-is-indicated-as-query-or-parameter-in-an-attempt-to-scrap-data-using<br> https://stackoverflow.com/questions/52345850/not-able-to-print-link-from-beautifulsoup-for-web-scrapping<br> https://stackoverflow.com/questions/52344564/web-scraping-data-that-was-shown-previously<br> https://stackoverflow.com/questions/52344305/trying-to-encode-decode-locations-when-scraping-a-website<br> https://stackoverflow.com/questions/52343297/cant-parse-the-titles-of-some-links-using-function<br> https: in C:\xampp\htdocs\differenttuts\simple_html_dom.php on line 75

Fatal error: Uncaught Error: Call to a member function find() on boolean in C:\xampp\htdocs\differenttuts\testfile.php:18 Stack trace: #0 C:\xampp\htdocs\differenttuts\testfile.php(23): reuse_links('https://stackov...') #1 {main} thrown in C:\xampp\htdocs\differenttuts\testfile.php on line 18

警告：文件获取内容(https://stackoverflow.com/questions/52347029/getting-all-the-image-urls-from-a-given-instagram-user
https://stackoverflow.com/questions/52346719/unable-to-print-links-in-another-function
https://stackoverflow.com/questions/52346308/bypassing-technical-limitations-of-instagram-bulk-scraping
https://stackoverflow.com/questions/52346159/pull-the-href-from-a-link-when-web-scraping-using-python
https://stackoverflow.com/questions/52346062/in-url-is-indicated-as-query-or-parameter-in-an-attempt-to-scrap-data-using
https://stackoverflow.com/questions/52345850/not-able-to-print-link-from-beautifulsoup-for-web-scrapping
https://stackoverflow.com/questions/52344564/web-sc先前显示的强奸数据
https://stackoverflow.com/questions/52344305/trying-to-encode-decode-locations-when-scraping-a-website
https://stackoverflow.com/questions/52343297/cant-parse-the-titles-of-some-links-using-function
https:C:\xampp\htdocs\differentittuts\simple\u html\u dom.php中的第75行
致命错误：未捕获错误：调用C:\xampp\htdocs\differentittuts\testfile.php中布尔值上的成员函数find（）：18堆栈跟踪：#0 C:\xampp\htdocs\differentittuts\testfile.php（23）：重用链接（'https://stackov...“）#1{main}在第18行的C:\xampp\htdocs\differentittuts\testfile.php中抛出

再一次：我希望我的脚本对登录页的链接进行标记，并解析目标页的标题。

我不太熟悉

simple\u html\u dom

，但我会尝试回答这个问题。这个库使用

file\u get\u contents

来执行HTTP请求，但在PHP7

file\u get\u contents

中，检索网络资源时不接受负偏移量（这是该库的默认值）

如果您使用的是PHP7，那么将偏移量设置为0

$html = file_get_html($url, false, null, 0);

在

get\u links

函数中，您可以将链接连接到字符串。我认为最好返回一个数组，因为在下一个函数中，新的HTTP请求需要这些链接。出于同样的原因，您不应该向链接添加断开标记，您可以在打印时断开链接

function get_links($url)
{
    $processed_links  = array();
    $base_url = implode("/", array_slice(explode("/", $url), 0, 3));
    $html = file_get_html($url, false, null, 0);
    foreach ($html->find(".summary h3 a") as $a) {
        $link = $base_url . $a->href;
        $processed_links[] = $link;
        echo $link . "<br>\n";
    }
    return $processed_links ;
}

function reuse_links($processed_links)
{
    foreach ($processed_links as $link) {
        $ihtml = file_get_html($link, false, null, 0);
        foreach ($ihtml -> find("h1 a") as $item) {
            echo $item->innertext . "<br>\n";
        }
    }
}

$url = "https://stackoverflow.com/questions/tagged/web-scraping";
$pro_links = get_links($url);
reuse_links($pro_links);

函数获取链接（$url）
{
$processed_links=array（）；
$base_url=内爆（“/”，数组_切片（分解（“/”，$url），0,3））；
$html=file\u get\u html（$url，false，null，0）；
foreach（$html->find（.summary h3 a）作为$a）{
$link=$base\u url.$a->href；
$processed_links[]=$link；
echo$link。“
\n”；
}
返回$U链接；
}
函数重用\u链接（$processed\u links）
{
foreach（$link作为$link处理）{
$ihtml=file\u get\u html（$link，false，null，0）；
foreach（$ihtml->find（“h1 a”）作为$item）{
echo$item->innertext。“
\n”；
}
}
}
$url=”https://stackoverflow.com/questions/tagged/web-scraping";
$pro_links=get_links（$url）；
重复使用链接（$pro_链接）；

我认为在

get\u links

中使用主url作为参数更有意义，我们可以从中获取基本url。我对基本url使用了数组函数，但您可以使用哪个函数作为合适的函数。

堆栈溢出拒绝了您的请求，所以请尝试使用curl爬网网站（不使用文件获取内容函数）你用foreach写

$ihtml>find

而不是

$ihtml->find

。抱歉，穆罕默德，这是个打字错误。我会纠正它的。谢谢。不要刮了。所以，使用RSS提要你也可以使用一个文档丰富的api：欢迎使用PHP！我希望你有空时能看看@t.m.adam。看来我需要追求的总是你。比ks。