PHP-收集所有trustpilot审查的数据_Php_Curl_Web Scraping_Trustpilot

PHP-收集所有trustpilot审查的数据

php curl web-scraping

PHP-收集所有trustpilot审查的数据,php,curl,web-scraping,trustpilot,Php,Curl,Web Scraping,Trustpilot,这段代码将获得所有25页的评论，例如example.com，然后我想做的是将所有结果放入一个JSON数组或其他东西中为了检索所有的名称，我尝试了下面的代码： <?php for ($x = 0; $x <= 25; $x++) { $ch = curl_init("https://uk.trustpilot.com/review/example.com?languages=all&page=$x"); //curl_setopt($ch, CURLOPT_POST, t

这段代码将获得所有25页的评论，例如example.com，然后我想做的是将所有结果放入一个JSON数组或其他东西中

为了检索所有的名称，我尝试了下面的代码：

<?php 
for ($x = 0; $x <= 25; $x++) {

$ch = curl_init("https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
//curl_setopt($ch, CURLOPT_POST, true);
//curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0); 
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout in seconds
$trustpilot = curl_exec($ch);

// Check if any errorccurred
if(curl_errno($ch))
{
     die('Fatal Error Occoured');
}

} 
?>

这显然比我预期的要困难得多，有人知道我怎么可能将所有评论都转换成JSON或其他东西，不管我选择了多少页面，例如，在本例中，我选择了25页的评论

谢谢

使用DOMDocument和DOMXPath解析em。此外，您为每个页面创建一个新的curl句柄，但从未关闭它们，这是代码中的资源/内存泄漏，也是cpu的浪费，因为您可以反复使用相同的curl句柄，而不是为每个页面创建一个新的curl句柄，这需要cpu，和protip：这个html压缩得相当好，所以你应该使用CURLOPT_编码来下载压缩的页面， e、 g:

因为对于您列出的url，这里只有1条评论。4d6bbf8a0000640002080bc2是该网站的内部id，可能是该审查的sql db id。

对于报废：1选择有用的“着陆”，其中包含该信息的任何类型的寻呼机，您需要->因此无需预先定义最大页面的原始值；2把一些响应示例->放在代码的第二部分，解析数据，因为您的第一部分代码curl link不提供废弃的数据示例；3当您使用php和您提到json时，请检查所有25页的函数-嗯，我只能在第1页上看到1篇评论..我得到警告：在第2行的D:\Servers\WebServer\htdocs\api\trustpilot.php中声明不受支持的“strict\u types”，并分析错误：语法错误，意外的“：”，预期的'{'@Jigger这段代码是为PHP7编写的。如果你试图在PHP5上运行这段代码，你会遇到几个错误，包括那个错误。你有关于DOM和加载HTML的阅读材料吗？我以前从未听说过这种做法，我想了解一下它是如何实现的works@Jigger真的想不出有什么。快速看一看，似乎相关且详细

<?php
$trustpilot = preg_replace('/\s+/', '', $trustpilot); //This replaces any spaces with no spaces
$first = explode( '"name":"' , $trustpilot );
$second = explode('"' , $first[1] );
$result = preg_replace('/[^a-zA-Z0-9-.*_]/', '', $second[0]);    //Don't allow special characters

?>

<?php
declare(strict_types = 1);
header("Content-Type: text/plain;charset=utf-8");
$ch = curl_init();
curl_setopt($ch, CURLOPT_ENCODING, ''); // enables compression
$reviews = [];
for ($x = 0; $x <= 25; $x ++) {
    curl_setopt($ch, CURLOPT_URL, "https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
    // curl_setopt($ch, CURLOPT_POST, true);
    // curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    // curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30); // timeout in seconds
    $trustpilot = curl_exec($ch);

    // Check if any errorccurred
    if (curl_errno($ch)) {
        die('fatal error: curl_exec failed, ' . curl_errno($ch) . ": " . curl_error($ch));
    }
    $domd = @DOMDocument::loadHTML($trustpilot);
    $xp = new DOMXPath($domd);
    foreach ($xp->query("//article[@class='review-card']") as $review) {
        $id = $review->getAttribute("id");
        $reviewer = $xp->query(".//*[@class='content-section__consumer-info']", $review)->item(0)->textContent;
        $stars = $xp->query('.//div[contains(@class,"star-item")]', $review)->length;
        $title = $xp->query('.//*[@class="review-info__body__title"]', $review)->item(0)->textContent;
        $text = $xp->query('.//*[@class="review-info__body__text"]', $review)->item(0)->textContent;
        $reviews[$id] = array(
            'reviewer' => mytrim($reviewer),
            'stars' => ($stars),
            'title' => mytrim($title),
            'text' => mytrim($text)
        );
    }
}
curl_close($ch);
echo json_encode($reviews, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | (defined("JSON_UNESCAPED_LINE_TERMINATORS") ? JSON_UNESCAPED_LINE_TERMINATORS : 0) | JSON_NUMERIC_CHECK);


function mytrim(string $text): string
{
    return preg_replace("/\s+/", " ", trim($text));
}

{
    "4d6bbf8a0000640002080bc2": {
        "reviewer": "Clement Skau Århus, DK, 3 reviews",
        "stars": 5,
        "title": "Godt fundet på!",
        "text": "Det er rigtig fint gjort at lave et example domain. :)"
    }
}