如何使用Selenium和Scrapy从csv抓取多个URL_Selenium_Web Scraping_Scrapy

如何使用Selenium和Scrapy从csv抓取多个URL

selenium web-scraping scrapy

如何使用Selenium和Scrapy从csv抓取多个URL,selenium,web-scraping,scrapy,Selenium,Web Scraping,Scrapy,我目前正在尝试从多个站点爬网目前，我有一个“ursl.txt”——包含两个URL的文件：1。2. 我遇到的问题如下：Selenium在同一个选项卡中逐个打开两个URL。因此，它只是在我的“ursl.txt”-文件中爬行第二个ULR的内容两次。它不会从第一个URL抓取任何内容我认为for循环以及如何调用“parse\u tip”函数存在问题。这是我的代码： import scrapy from scrapy import Spider from scrapy.selector import

我目前正在尝试从多个站点爬网目前，我有一个“ursl.txt”——包含两个URL的文件：1。2.

我遇到的问题如下：Selenium在同一个选项卡中逐个打开两个URL。因此，它只是在我的“ursl.txt”-文件中爬行第二个ULR的内容两次。它不会从第一个URL抓取任何内容
我认为for循环以及如何调用“
parse\u tip
”函数存在问题。这是我的代码：

import scrapy from scrapy import Spider from scrapy.selector import Selector from scrapy.http import Request from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.action_chains import ActionChains from time import sleep import re import csv from time import sleep class AlltipsSpider(Spider): name = 'alltips' allowed_domains = ['blogabet.com'] # We are not using the response parameter in this function because the start urls are not defined # Our class Spider is searching for the function start_requests by default # Request has to returned or yield def start_requests(self): self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe') with open("urls.txt", "rt") as f: start_urls = [url.strip() for url in f.readlines()] for url in start_urls: self.driver.get(url) self.driver.find_element_by_id('currentTab').click() sleep(3) self.logger.info('Sleeping for 5 sec.') self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click() sleep(7) self.logger.info('Sleeping for 7 sec.') yield Request(url, callback=self.parse_tip) def parse_tip(self, response): sel = Selector(text=self.driver.page_source) allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]') for post in allposts: username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract() publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract() yield{'Username': username, 'Publish date': publish_date }

当您已经收到Selenium的响应时，为什么还要执行另一个请求
来生成请求（url，callback=self.parse\u tip）
。只需将响应文本传递给
parse_tip
，并在其中使用文本即可

class AlltipsSpider(Spider): name = 'alltips' allowed_domains = ['blogabet.com'] def start_requests(self): self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe') with open("urls.txt", "rt") as f: start_urls = [url.strip() for url in f.readlines()] for url in start_urls: self.driver.get(url) self.driver.find_element_by_id('currentTab').click() sleep(3) self.logger.info('Sleeping for 5 sec.') self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click() sleep(7) self.logger.info('Sleeping for 7 sec.') for item in self.parse_tip(text= self.driver.page_source): yield item def parse_tip(self, text): sel = Selector(text=text) allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]') for post in allposts: username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract() publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract() yield{'Username': username, 'Publish date': publish_date }

你试过运行代码吗？我在yield self.parse_-tip（text=self.driver.page_-source）中的项出现了一个错误：^SyntaxError:无效语法我没有运行代码，你应该注意语法错误，我是来告诉逻辑的，无论如何，self.parse_-tip（text=self.driver.page_-source）中的项应该是
，，，我在回答中也更新了代码，请参见