Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/selenium/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy-Selenium-请求下一页_Python_Selenium_Scrapy - Fatal编程技术网

Python Scrapy-Selenium-请求下一页

Python Scrapy-Selenium-请求下一页,python,selenium,scrapy,Python,Selenium,Scrapy,我正在尝试制作一个webcrawler,它指向一个链接并等待Javascript内容加载。然后,在进入下一页之前,它应该获得所列文章的所有链接。问题是它总是从第一个url(“”)中刮取,而不是按照我给出的url。为什么下面的代码没有从我在请求中传递的新URL中删除?我没有主意了 import scrapy from scrapy.http.request import Request from selenium import webdriver from selenium.webdriver.c

我正在尝试制作一个webcrawler,它指向一个链接并等待Javascript内容加载。然后,在进入下一页之前,它应该获得所列文章的所有链接。问题是它总是从第一个url(“”)中刮取,而不是按照我给出的url。为什么下面的代码没有从我在请求中传递的新URL中删除?我没有主意了

import scrapy
from scrapy.http.request import Request
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
import time


class TechcrunchSpider(scrapy.Spider):
    name = "techcrunch_spider_performance"
    allowed_domains = ['techcrunch.com']
    start_urls = ['https://techcrunch.com/search/heartbleed']



    def __init__(self):
        self.driver = webdriver.PhantomJS()
        self.driver.set_window_size(1120, 550)
        #self.driver = webdriver.Chrome("C:\Users\Daniel\Desktop\Sonstiges\chromedriver.exe")
        self.driver.wait = WebDriverWait(self.driver, 5)    #wartet bis zu 5 sekunden

    def parse(self, response):
        start = time.time()     #ZEITMESSUNG
        self.driver.get(response.url)

        #wartet bis zu 5 sekunden(oben definiert) auf den eintritt der condition, danach schmeist er den TimeoutException error
        try:    

            self.driver.wait.until(EC.presence_of_element_located(
                (By.CLASS_NAME, "block-content")))
            print("Found : block-content")

        except TimeoutException:
            self.driver.close()
            print(" block-content NOT FOUND IN TECHCRUNCH !!!")


        #Crawle durch Javascript erstellte Inhalte mit Selenium

        ahref = self.driver.find_elements(By.XPATH,'//h2[@class="post-title st-result-title"]/a')

        hreflist = []
        #Alle Links zu den jeweiligen Artikeln sammeln
        for elem in ahref :
            hreflist.append(elem.get_attribute("href"))


        for elem in hreflist :
            print(elem)
            yield scrapy.Request(url=elem , callback=self.parse_content)


        #Den link fuer die naechste seite holen
        try:    
            next = self.driver.find_element(By.XPATH,"//a[@class='page-link next']")
            nextpage = next.get_attribute("href")
            print("JETZT KOMMT NEXT :")
            print(nextpage)
            #newresponse = response.replace(url=nextpage)
            yield scrapy.Request(url=nextpage, dont_filter=False)

        except TimeoutException:
            self.driver.close()
            print(" NEXT NOT FOUND(OR EOF) IM CLOSING MYSELF !!!")



        end = time.time()
        print("Time elapsed : ")
        finaltime = end-start
        print(finaltime)


    def parse_content(self, response):    
        title = self.driver.find_element(By.XPATH,"//h1")
        titletext = title.get_attribute("innerHTML")
        print(" h1 : ")
        print(title)
        print(titletext)

第一个问题是:

for elem in hreflist :
        print(elem)
        yield scrapy.Request(url=elem , callback=self.parse_content)
这段代码对找到的所有链接都产生了请求。但是:

def parse_content(self, response):    
    title = self.driver.find_element(By.XPATH,"//h1")
    titletext = title.get_attribute("innerHTML")
parse_content函数尝试使用驱动程序解析页面。您可以尝试使用scrapy中的response元素进行解析,或者使用webdriver(self.driver.get(..)加载页面


此外,scrapy是异步的,selenium不是。不是在scrapy屈服请求后阻塞,而是继续执行代码,因为它是基于twisted构建的,可以启动多个并发请求。selenium驱动程序实例将无法跟踪来自scrapy的多个并发请求。(一个线索是用selenium代码替换每个收益,即使这意味着失去执行时间)

我添加了一个self.driver.get(…)来解析内容,现在我可以得到h1标题了。尽管如此,下一页仍然不起作用。我应该如何用selenium代码替换产量?你有一个例子吗?我对刮痧和硒不是很有经验。谢谢尝试用函数parse_content的内容替换'yield scrapy.Request(url=elem,callback=self.parse_content)行。对于下一个_页面问题,您可以用围绕parse函数中所有代码的循环替换yield(当有下一个页面时,执行smthg)