PHP简单HTML DOM解析器-组合两个数组_Php_Html_Dom_Web Scraping

PHP简单HTML DOM解析器-组合两个数组

php html dom web-scraping

PHP简单HTML DOM解析器-组合两个数组,php,html,dom,web-scraping,Php,Html,Dom,Web Scraping,我想做的是在Trip Advisor上刮取一个页面-我从第一个页面获得了我需要的内容，然后我做了另一个循环以从下一个页面获取内容，但是当我尝试将这些细节添加到现有的数组中时，由于某种原因它不起作用 error_reporting(E_ALL); include_once('simple_html_dom.php'); $html = file_get_html('http://www.tripadvisor.co.uk/Hotels-g186534-c2-Glasgow_Scotland-Ho

我想做的是在Trip Advisor上刮取一个页面-我从第一个页面获得了我需要的内容，然后我做了另一个循环以从下一个页面获取内容，但是当我尝试将这些细节添加到现有的数组中时，由于某种原因它不起作用

error_reporting(E_ALL);
include_once('simple_html_dom.php');

$html = file_get_html('http://www.tripadvisor.co.uk/Hotels-g186534-c2-Glasgow_Scotland-Hotels.html');

$articles = '';

// Find all article blocks
foreach($html->find('.listing') as $hotel) {
    $item['name']     = $hotel->find('.property_title', 0)->plaintext;
    $item['link']     = $hotel->find('.property_title', 0)->href;

    $item['rating']    = $hotel->find('.sprite-ratings', 0)->alt;
    $item['rating']    = explode(' ', $item['rating']);
    $item['rating']    = $item['rating'][0];

    $articles[] = $item;
}

foreach($articles as $article) {

    echo '<pre>';
    print_r($article);
    echo '</pre>';

   $hotel_html = file_get_html('http://www.tripadvisor.co.uk'.$article['link'].'/');

   foreach($hotel_html->find('#MAIN') as $hotel_page) {
       $article['address']            = $hotel_page->find('.street-address', 0)->plaintext;
       $article['extendedaddress']    = $hotel_page->find('.extended-address', 0)->plaintext;
       $article['locality']           = $hotel_page->find('.locality', 0)->plaintext;
       $article['country']            = $hotel_page->find('.country-name', 0)->plaintext;

       echo '<pre>';
       print_r($article);
       echo '</pre>';

       $articles[] = $article;
    }
}

echo '<pre>';
print_r($articles);
echo '</pre>';

错误报告（E_ALL）；
包括一次（'simple_html_dom.php'）；
$html=file\u get\u html（'http://www.tripadvisor.co.uk/Hotels-g186534-c2-Glasgow_Scotland-Hotels.html');
$articles=''；
//查找所有文章块
foreach（$html->find（'.listing'）作为$hotel）{
$item['name']=$hotel->find（'.property_title'，0）->纯文本；
$item['link']=$hotel->find（'.property_title'，0）->href；
$item['rating']=$hotel->find（'.sprite ratings'，0）->alt；
$item['rating']=爆炸（“”，$item['rating']）；
$item['rating']=$item['rating'][0]；
$articles[]=$item；
}
foreach（$articles作为$article）{
回声'；
印刷品（文章）；
回声'；
$hotel\u html=file\u get\u html（'http://www.tripadvisor.co.uk“.$article['link']./”）；
foreach（$hotel_html->find（'#MAIN'）作为$hotel_页面）{
$article['address']=$hotel_page->查找（'street address'，0）->明文；
$article['extendedaddress']=$hotel_page->find（'.extendedaddress'，0）->明文；
$article['locality']=$hotel_page->find（'.locality'，0）->纯文本；
$article['country']=$hotel_page->查找（'.country name'，0）->纯文本；
回声'；
印刷品（文章）；
回声'；
$articles[]=$article；
}
}
回声'；
印刷品（文章）；
回声'；

以下是我获得的所有调试输出：

URL:

我会改变

$articlesNew[] = $article;

致：

在foreach（）之前：

迭代数组时，插入新数组

最后合并数组

来源：有关更多数组php合并/合并

我在PHP中已经迭代过数组时，从来没有试图更改数组，但是如果用C++集合不正确地处理数组，除非您处理致命异常，否则它会崩溃。我的猜测是，在迭代数组时不应该修改它。我知道我永远不会那样做。使用另一个变量

最好使用SimpleXML或DomDocument。只是说说而已。我知道这听起来可能有点蹩脚，因为你不是自找的。所以我现在沉默了。使用XML库进行web抓取的问题是，它不能容忍任何无效XML的标记，即使该站点声称是XHTML，也很可能是无效XML。simple_html_dom以一种更像浏览器的“标记汤”方式进行解析，因此可以生成更健壮的刮刀。

$articles = array();

$articlesNew = array();

$articlesNew[] = $article;

$articles = array_merge($articles, $articlesNew);