Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/266.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Php 从网站中抓取数据并以纯文本形式获取其html_Php - Fatal编程技术网

Php 从网站中抓取数据并以纯文本形式获取其html

Php 从网站中抓取数据并以纯文本形式获取其html,php,Php,请检查下面的代码。我试图刮网站使用代理和它的工作现在。问题在于print\r以不可读格式显示数据。我需要使它“正常”的html源代码。我怎么做 <?php $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'https://www.amazon.com'); curl_setopt($ch, CURLOPT_HEADER, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt

请检查下面的代码。我试图刮网站使用代理和它的工作现在。问题在于
print\r
以不可读格式显示数据。我需要使它“正常”的html源代码。我怎么做

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.amazon.com');
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_PROXY, '142.234.203.59:12345');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'haris20202:veryfastplease123');
$data = curl_exec($ch);
curl_close($ch);

print_r($data);

包含标题('Content-Type:application/json');在您的文件中,使用功能稍全的curl函数获取字符串类型的响应,响应上方的函数看起来不错,但它包含一个
机器人检查

* Rebuilt URL to: https://www.amazon.com/
*   Trying 142.234.203.59...
* TCP_NODELAY set
* Connected to 142.234.203.59 (142.234.203.59) port 12345 (#0)
* allocate connect buffer!
* Establish HTTP proxy tunnel to www.amazon.com:443
* Proxy auth using Basic with user 'haris20202'
> CONNECT www.amazon.com:443 HTTP/1.1
Host: www.amazon.com:443
Proxy-Authorization: Basic aGFyaXMyMDIwMjp2ZXJ5ZmFzdHBsZWFzZTEyMw==
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Proxy-Connection: Keep-Alive

< HTTP/1.1 200 Connection established
< 
* Proxy replied 200 to CONNECT request
* CONNECT phase completed!
* ALPN, offering http/1.1
* successfully set certificate verify locations:
  CAfile: c:/wwwroot/cacert.pem
  CApath: none
* CONNECT phase completed!
* CONNECT phase completed!
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=Washington; L=Seattle; O=Amazon.com, Inc.; CN=www.amazon.com
*  start date: Sep 18 00:00:00 2019 GMT
*  expire date: Aug 23 12:00:00 2020 GMT
*  subjectAltName: host "www.amazon.com" matched cert's "www.amazon.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert Global CA G2
*  SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.amazon.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Accept: */*
Accept-Encoding: deflate, gzip

< HTTP/1.1 200 OK
< Content-Type: text/html
< Content-Length: 2097
< Connection: keep-alive
< Server: Server
< Date: Tue, 26 Nov 2019 10:14:10 GMT
< Vary: Content-Type,Cookie,Referer,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
< Content-Encoding: gzip
< x-amz-rid: DTAY61T1CN3HGSADJG16
< Edge-Control: no-store
< X-Cache: Miss from cloudfront
< Via: 1.1 274469ea4a9ada6e05630e17982ca5de.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: PHL50
< X-Amz-Cf-Id: R3hAZb_0qdQYB25p3WwZ5D-wK_1ujzleVSOS7EZo_zsTyMx9oYU6CA==
< 
* Connection #0 to host 142.234.203.59 left intact
*重建的URL到:https://www.amazon.com/
*正在尝试142.234.203.59。。。
*TCP_节点集
*连接到142.234.203.59(142.234.203.59)端口12345(#0)
*分配连接缓冲区!
*建立HTTP代理隧道到www.amazon.com:443
*使用Basic与用户“haris20202”进行代理身份验证
>连接www.amazon.com:443http/1.1
主持人:www.amazon.com:443
代理授权:基本AGFYAXMYMDIWMJP2ZXJ5ZMFZDBSZWFZFTEYMW==
用户代理:Mozilla/5.0(Windows NT 6.1;Win64;x64)AppleWebKit/537.36(KHTML,类似Gecko)Chrome/58.0.3029.110 Safari/537.36
代理连接:保持活动状态
GET/HTTP/1.1
主持人:www.amazon.com
用户代理:Mozilla/5.0(Windows NT 6.1;Win64;x64)AppleWebKit/537.36(KHTML,类似Gecko)Chrome/58.0.3029.110 Safari/537.36
接受:*/*
接受编码:deflate,gzip


亚马逊有一个API——你考虑过使用它吗

有什么解决方案吗?亚马逊已经实施了一个
BOT-check
系统,该系统可能会产生干扰。那么
$data
的价值是什么?这里您可以看到第一点,您需要配置CURL来执行HTTPS请求!第二点是,同样使用正确的CURL选项设置,您将无法获得所需的信息,因为如上所述,它们使用BOT检查,这意味着您将收到一条通知,以联系amazon API支持;)