如何使用Python3抓取Amazon

如何使用Python3抓取Amazon,python,web-scraping,urllib,Python,Web Scraping,Urllib,我试图阅读给定产品的所有注释,这既是为了学习python,也是为了一个项目,为了简化我的任务,我随机选择了一个产品进行编码 我想读的链接是Amazons,我使用urllib打开链接 amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=

我试图阅读给定产品的所有注释,这既是为了学习python,也是为了一个项目,为了简化我的任务,我随机选择了一个产品进行编码

我想读的链接是Amazons,我使用urllib打开链接

amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
当我显示amazon时,将链接读入“amazon”变量后,我得到以下消息

print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>
我如何阅读页面,并将其传递给beautiful soup

编辑1

我确实使用了request.get,当我检查检索到的页面文本中的内容时,我发现下面的内容与网站链接不匹配

print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">

<!--
        To discuss automated access to Amazon data please contact api-services-support@amazon.com.
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->

<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>

<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b><a href="http://www.amazon.in/ref=cs_503_link/">Go to the Amazon.in home page to continue shopping</a></b>
</font>

</center>
</body>
</html>
打印(a2)
503-服务不可用错误

哎呀
现在是高峰时间,那一页上的交通拥挤不堪。请稍后再试。
如果您试图下订单,此时将无法处理。


我个人会将请求库用于此,而不是urllib。请求具有更多功能

import requests
从那里可以看出:

resp = requests.get(url) #You can break up your paramters and pass base_url & params to this as well if you have multiple products to deal with
soup = BeautifulSoup(resp.text)
这是一个非常简单的http请求,所以应该回复邮件

编辑: 基于您的错误,您必须研究要传递的参数,以使您的请求看起来正确。一般来说,对于请求,它看起来是这样的(显然,对于您发现的值,请检查您的浏览器调试/开发人员选项,以检查您的网络流量,并查看在使用浏览器时向amazon发送的内容):


使用当前库urllib。这就是你能做的!使用.read()获取HTML。然后像这样把它传给BeautifulSoup。请记住,亚马逊是一个重防刮网站。得到不同结果的可能性可能是因为HTML被包装在JavaScript中。为此,您可能需要使用硒或干刮。您可能还需要将标题/cookie和其他属性传递到您的请求中

amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
html = amazon.read()
soup = BeautifulSoup(html)
编辑----你现在正在使用请求。使用像这样传入我的头的请求,我可以得到200个响应

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
soup = BeautifulSoup(response)
response[200]
---使用干刮

import dryscrape
from bs4 import BeautifulSoup

sess = dryscrape.Session(base_url='http://www.amazon.in')
sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
html = sess.body()
soup = BeautifulSoup(html)
print soup

##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html

你犯了那个错误;您很可能需要向请求传递额外的头。查找Urllib集合标题。你需要通过传入用户代理和其他属性在浏览器中扮演一个人的角色。我确实使用了你的请求,但ws无法检索任何内容,这可能是因为正如你所说的,amazon是反抓取网站,你能够运行代码吗?嘿。我将更新线程,使其使用python请求。这应该管用!查看编辑的版本@这的确是非常正确的。这就是为什么我在这篇文章的开头提到Selenium/Drysrape.oops,我的坏消息。我可能回答了错误的评论
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
soup = BeautifulSoup(response)
response[200]
import dryscrape
from bs4 import BeautifulSoup

sess = dryscrape.Session(base_url='http://www.amazon.in')
sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
html = sess.body()
soup = BeautifulSoup(html)
print soup

##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html