Python 刮擦请求不起作用
我正在使用Python 刮擦请求不起作用,python,selenium,web-scraping,scrapy,scrapy-spider,Python,Selenium,Web Scraping,Scrapy,Scrapy Spider,我正在使用selenium和Scrapy收集每个信息 我需要浏览每一个公司名称,当我到达公司信息页面时,我需要从公司信息页面提取信息,我还需要打开营销联系人页面并从那里提取信息Scrapy的请求打开公司信息页面,但当我试图打开营销联系人页面时,该页面不起作用 这是我的密码: # -*- coding: utf-8 -*- from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors
selenium
和Scrapy
收集每个信息
我需要浏览每一个公司名称,当我到达公司信息页面时,我需要从公司信息页面提取信息,我还需要打开营销联系人页面并从那里提取信息Scrapy的请求
打开公司信息页面,但当我试图打开营销联系人页面时,该页面不起作用
这是我的密码:
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
import time
class HooverSpider(CrawlSpider):
name = "hspider"
allowed_domains = ["hoovers.com"]
start_urls = ["http://www.hoovers.com/company-information/company-search.html?term=australia&maxitems=25&nvcnt=4&nvsls=[5;10L&nvloc=0&nvemp=[11;49]"] #hloru
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self,response):
self.driver.get(response.url)
time.sleep(3)
company = self.driver.find_elements_by_xpath('//div[@class="cmp-company-directory"]/div[1]/table/tbody/tr/td/a')
links = []
for c in company:
links.append(c.get_attribute('href'))
for link in links:
yield Request(str(link),self.parse_link)
def parse_link(self,response):
self.driver.get(response.url)
time.sleep(2)
if (self.driver.find_element_by_xpath('//div[@class="left-content"]/h1').text):
title = self.driver.find_element_by_xpath('//div[@class="left-content"]/h1').text
else:
title = ''
print title
if (self.driver.find_element_by_xpath('//div[@class="left-content"]/p/span[1]').text):
street = self.driver.find_element_by_xpath('//div[@class="left-content"]/p/span[1]').text
else:
street = ''
print street
marketing = self.driver.find_element_by_xpath('//*[@id="fs-comps-A"]/div/div/div/div[1]/div/div[1]/div/ul[2]/li[2]/a').get_attribute('href')
print marketing
return Request(marketing,self.parse_page)
#this one is not working
def parse_page(self,response):
print response.url
self.driver.get(response.url)
time.sleep(3)
print 'hello'
但这段代码仍然有效
class HooverSpider(CrawlSpider):
name = "hspider"
allowed_domains = ["hoovers.com"]
start_urls = ["http://www.hoovers.com/company-information/cs/marketing-lists.LAFFORT_AUSTRALIA_PTY_LIMITED.3d01c1d98ad9322f.html"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self,response):
self.driver.get(response.url)
time.sleep(3)
marketing = self.driver.find_element_by_xpath('//*[@id="fs-comps-A"]/div/div/div/div[1]/div/div[1]/div/ul[2]/li[2]/a').get_attribute('href')
return Request(marketing,callback=self.parse_page)
def parse_page(self,response):
print 'hh'
主要问题在于你获取“营销联系人”链接的方式。我会使用“营销联系人”链接文本来定位链接 此外,这里需要注意的几点是:
- 您不能在不同的Scrapy方法/回调中使用相同的驱动程序实例(
)-Scrapy是完全异步的,您很快就会遇到两个方法在尝试重用相同的驱动程序实例时发生冲突的情况,从而出现浏览器窗口self.driver
- 使用
time.sleep()代替硬编码的时间延迟
- 确保您真的需要selenium来解析从搜索结果到公司简介再到营销联系人的所有页面
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class HooverSpider(CrawlSpider):
name = "hspider"
...
def parse_link(self, response):
driver = webdriver.Firefox()
driver.get(response.url)
marketing = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Marketing Contacts")))
marketing_link = marketing.get_attribute('href')
driver.close()
yield Request(marketing_link, self.parse_page)
def parse_page(self, response):
print "HERE!"
print response.url
print "-----------"
如果您确实需要selenium
,请考虑使用无头PhantomJS
而不是Firefox
——至少可以提高性能并使其在无显示环境中工作(将webdriver.Firefox()
替换为webdriver.PhantomJS()
)