如何使用Selenium和Scrapy从csv抓取多个URL
我目前正在尝试从多个站点爬网 目前,我有一个“ursl.txt”——包含两个URL的文件:1。2. 我遇到的问题如下:Selenium在同一个选项卡中逐个打开两个URL。因此,它只是在我的“ursl.txt”-文件中爬行第二个ULR的内容两次。它不会从第一个URL抓取任何内容 我认为for循环以及如何调用“如何使用Selenium和Scrapy从csv抓取多个URL,selenium,web-scraping,scrapy,Selenium,Web Scraping,Scrapy,我目前正在尝试从多个站点爬网 目前,我有一个“ursl.txt”——包含两个URL的文件:1。2. 我遇到的问题如下:Selenium在同一个选项卡中逐个打开两个URL。因此,它只是在我的“ursl.txt”-文件中爬行第二个ULR的内容两次。它不会从第一个URL抓取任何内容 我认为for循环以及如何调用“parse\u tip”函数存在问题。这是我的代码: import scrapy from scrapy import Spider from scrapy.selector import
parse\u tip
”函数存在问题。这是我的代码:
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
import re
import csv
from time import sleep
class AlltipsSpider(Spider):
name = 'alltips'
allowed_domains = ['blogabet.com']
# We are not using the response parameter in this function because the start urls are not defined
# Our class Spider is searching for the function start_requests by default
# Request has to returned or yield
def start_requests(self):
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver.get(url)
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
yield Request(url, callback=self.parse_tip)
def parse_tip(self, response):
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract()
publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract()
yield{'Username': username,
'Publish date': publish_date
}
当您已经收到Selenium的响应时,为什么还要执行另一个请求
来生成请求(url,callback=self.parse\u tip)
。
只需将响应文本传递给parse_tip
,并在其中使用文本即可
class AlltipsSpider(Spider):
name = 'alltips'
allowed_domains = ['blogabet.com']
def start_requests(self):
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver.get(url)
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
for item in self.parse_tip(text= self.driver.page_source):
yield item
def parse_tip(self, text):
sel = Selector(text=text)
allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract()
publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract()
yield{'Username': username,
'Publish date': publish_date
}
你试过运行代码吗?我在yield self.parse_-tip(text=self.driver.page_-source)中的项出现了一个错误:^SyntaxError:无效语法我没有运行代码,你应该注意语法错误,我是来告诉逻辑的,无论如何,self.parse_-tip(text=self.driver.page_-source)中的项应该是
,,,我在回答中也更新了代码,请参见