Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/284.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用PHP刮取网页?_Php_Curl_Web Scraping - Fatal编程技术网

如何使用PHP刮取网页?

如何使用PHP刮取网页?,php,curl,web-scraping,Php,Curl,Web Scraping,我正试图刮一个网页,并从中解析一些数据。但每次我尝试刮取时,只会得到http响应头。这是我用来从网站获取数据的代码 $host = 'Host: dealnews.com'; $user_agent = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0'; $accept = 'Accept: text/html,application/xhtml+xml,applicat

我正试图刮一个网页,并从中解析一些数据。但每次我尝试刮取时,只会得到http响应头。这是我用来从网站获取数据的代码

$host = 'Host: dealnews.com';
$user_agent = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0';
$accept = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$accept_language = 'Accept-Language: en-US,en;q=0.5';
$accept_encoding = 'Accept-Encoding: gzip, deflate';
$connection = 'Connection: keep-alive';
$cookie = 'Cookie=front_page_sort=hotness; dnvta=%7B%22uid%22%3A%22VkA1VlBBb0tNcXdBQVF6UlJrTUFBQUJN%22%2C%22vid%22%3A%22VkA1bGx3b0tNcXdBQVF6bW53QUFBQUEt%22%2C%22fvts%22%3A1475237180%2C%22lvts%22%3A1475241453%2C%22ref%22%3A%22%22%2C%22usid%22%3A0%2C%22ct%22%3A2%2C%22cr%22%3A1475237180%7D; last_visit=1475241457; _ceg.s=oebjle; _ceg.u=oebjle; _ga=GA1.2.185245695.1475237222; __gads=ID=1921ec3c3fe54b1b:T=1475237222:S=ALNI_MZJZEuNpmg3Aq5e007E7iFjwuQ0nw; original_eref=DIRECT; _gat=1; mp_dealnews_mixpanel=%7B%22distinct_id%22%3A%20%221577afe52c549-01b1cfdcc8ca548-13666c4a-100200-1577afe52c620c%22%7D';

$requestHeaders = array ( $host, $user_agent, $accept, $accept_encoding, $accept_language, $connection, $cookie );

$ch = curl_init ( 'http://dealnews.com/2-LED-Window-Candles-w-Color-Changing-Bulbs-for-4-2-s-h/1797165.html?iref=rss-dealnews-todays-edition' );
curl_setopt ( $ch, CURLOPT_TIMEOUT, 100 );
curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 100 );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt ( $ch, CURLOPT_SSL_VERIFYHOST, false );
curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt ( $ch, CURLOPT_HEADER, TRUE );
curl_setopt ( $ch, CURLOPT_ENCODING, "gzip" );
curl_setopt ( $ch, CURLOPT_HTTPHEADER, $requestHeaders );
$data = curl_exec ( $ch );
if (! $data) {
    die ( "Error: " . curl_error ( $ch ) . " Error no: " . curl_errno ( $ch ) );
}
curl_close ( $ch );
$htmlContent = str_get_html ( $data );
echo $htmlContent;
但是它给了我下面给出的错误

HTTP/1.1 302 Found Date: Fri, 30 Sep 2016 13:50:44 GMT Server: Apache X-Powered-By: PHP/5.5.9-1ubuntu4.19 Status: 302 Found Location: /lw/landing.html?uri=%2F2-LED-Window-Candles-w-Color-Changing-Bulbs-for-4-2-s-h%2F1797165.html%3Firef%3Drss-dealnews-todays-edition Content-Encoding: gzip Vary: Accept-Encoding Content-Length: 20 X-Cnection: close Content-Type: text/html; charset=utf-8
那么,有人能帮我解决我在这方面的错误吗

curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);

标题
302
是一个重定向信息。

如果您希望使用PHP进行ScreenSrape,我已经使用库成功地完成了。它非常简单,易于使用。我知道这个网站看起来有点过时,但我去年的代码仍然运行良好。尚未出现CRON错误。

这是一个重定向。启用
CURLOPT\u FOLLOWLOCATION
选项。哦,是的!谢谢@Barmar可能是重复的我总是觉得回答关于网站刮擦或类似的问题是不道德的…:\@CD001我正在为同一个客户做抓取工作,我将抓取他的网站。所以只有在他们允许的情况下,我才能刮。