Python 亚马逊网页抓取_Python_Web Scraping_Beautifulsoup_Phantomjs_Amazon

Python 亚马逊网页抓取

python web-scraping phantomjs

Python 亚马逊网页抓取,python,web-scraping,beautifulsoup,phantomjs,amazon,Python,Web Scraping,Beautifulsoup,Phantomjs,Amazon,我正试图用phantomjs和python来降低亚马逊的价格。我想用beautiful soup解析它，以获取书籍的新的和使用过的价格，问题是：当我通过phantomjs请求的源代码时，价格仅为0,00，代码就是这个简单的测试我对网页抓取是新手，但我不明白亚马逊是否有措施避免抓取价格，还是我做得不对，因为我尝试了其他更简单的页面，我可以得到我想要的数据我所在的国家不支持使用AmazonAPI，这就是为什么需要使用scraper import re import urlparse from

我正试图用phantomjs和python来降低亚马逊的价格。我想用beautiful soup解析它，以获取书籍的新的和使用过的价格，问题是：当我通过phantomjs请求的源代码时，价格仅为0,00，代码就是这个简单的测试

我对网页抓取是新手，但我不明白亚马逊是否有措施避免抓取价格，还是我做得不对，因为我尝试了其他更简单的页面，我可以得到我想要的数据

我所在的国家不支持使用AmazonAPI，这就是为什么需要使用scraper

import re
import urlparse

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

link = 'http://www.amazon.com/gp/offer-listing/1119998956/ref=dp_olp_new?ie=UTF8&condition=new'#'http://www.amazon.com/gp/product/1119998956'

class AmzonScraper(object):
    def __init__(self):
        self.driver = webdriver.PhantomJS()
        self.driver.set_window_size(1120, 550)

    def scrape_prices(self):
        self.driver.get(link)
        s = BeautifulSoup(self.driver.page_source)
        return s

    def scrape(self):
        source = self.scrape_prices()
        print source
        self.driver.quit()

if __name__ == '__main__':
    scraper = TaleoJobScraper()
    scraper.scrape()

首先，根据@Nick Bailey的评论，研究使用条款，确保你方没有违规行为

要解决这个问题，您需要调整

PhantomJS

所需的功能：

而且，为了使它防弹，你可以做一个测试，然后等待价格变为非零：

用法：

def scrape_prices(self):
    self.driver.get(link)

    WebDriverWait(self.driver, 200).until(wait_for_price((By.CLASS_NAME, "olpOfferPrice")))
    s = BeautifulSoup(self.driver.page_source)

    return s

关于将phantomjs的用户代理设置为普通浏览器的用户代理，回答得很好。既然你说你的国家被亚马逊封锁了，那么我想你也需要设置一个代理

下面是一个如何使用firefox用户代理和代理在python中启动phantomJS的示例

from selenium.webdriver import *
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
service_args = [ '--proxy=1.1.1.1:port', '--proxy-auth=username:pass'  ]
dcap = dict( DesiredCapabilities.PHANTOMJS )
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0"
driver = PhantomJS( desired_capabilities = dcap, service_args=service_args )

其中1.1.1.1是您的代理ip，端口是代理端口。此外，用户名和密码仅在代理需要身份验证时才是必需的。

另一个尝试的框架是，它比用于模拟浏览器交互的selenium更简单。Scrapy为您提供了使用

CSS选择器

或

XPath

轻松解析数据的类，以及以任何格式存储数据的管道，例如将数据写入

MongoDB

数据库

通常情况下，您可以编写一个完整构建的spider，并在中用10行代码将其部署到Scrapy cloud中

查看这段YT视频，了解如何使用Scrapy作为一个用例
仅供参考，你不应该说你正在这样做，这违反了亚马逊的ToS，你可能会遇到大麻烦。你在哪里刮东西？@PadraicCunningham是的，很明显这与网络刮完全无关。类名是
amzonsrapper
，所以它是关于
Amzon
store-一个完全不同的网络商店。@alecxe，
taleojobsrapper（）
在代码中找不到，我看到OP的代码只是下载html。
def scrape_prices(self): self.driver.get(link) WebDriverWait(self.driver, 200).until(wait_for_price((By.CLASS_NAME, "olpOfferPrice"))) s = BeautifulSoup(self.driver.page_source) return s

from selenium.webdriver import * from selenium.webdriver.common.desired_capabilities import DesiredCapabilities service_args = [ '--proxy=1.1.1.1:port', '--proxy-auth=username:pass' ] dcap = dict( DesiredCapabilities.PHANTOMJS ) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0" driver = PhantomJS( desired_capabilities = dcap, service_args=service_args )