Python:Scrapy Splash使用爬行蜘蛛递归爬行不工作_Scrapy_Splash Screen_Scrapy Splash

Python:Scrapy Splash使用爬行蜘蛛递归爬行不工作

scrapy

Python:Scrapy Splash使用爬行蜘蛛递归爬行不工作,scrapy,splash-screen,scrapy-splash,Scrapy,Splash Screen,Scrapy Splash,我在我的爬行蜘蛛中集成了scrapy splash，它只会爬行呈现开始URL。想知道如何让scrapy splash抓取内部链接。我一直在互联网上寻找解决方案，但似乎没有一个可行的解决方案以下是我的代码： import scrapy from scrapy.selector import Selector from scrapy.linkextractors import LinkExtractor from scrapy.spiders import Rule, CrawlSpider

我在我的爬行蜘蛛中集成了scrapy splash，它只会爬行呈现开始URL。想知道如何让scrapy splash抓取内部链接。我一直在互联网上寻找解决方案，但似乎没有一个可行的解决方案

以下是我的代码：

import scrapy
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider

from scrapy.item import Item, Field
from scrapy import Request

#import requests
from scrapy_splash import SplashRequest


class Website(scrapy.Item):
    url = Field()
    response = Field()

class houzzspider(CrawlSpider):
    handle_httpstatus_list = [404, 500]
    name = "example"
    allowed_domains = ["localhost","www.example.com"]
    start_urls = ["https://www.example.com/"]    

    rules = (
    Rule(
        LinkExtractor(
            allow=(),
            deny=(),process_value=''),
            callback="parse_items",
            process_links="process_links",
            follow=True,
            ),
    Rule(
        LinkExtractor(
            allow=(),
            deny=(),process_value=''),
            follow=True,
            ),
    )

    def process_links(self, links):
        for link in links:
            if "http://localhost:8050/render.html?&" not in link.url:
                link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,'wait':2.0})
        return links   

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse_items,
                endpoint='render.html',
                args={'wait': 0.5},)    

    def parse_items(self, response):
        hxs = Selector(response)
        sites = response.selector.xpath('//html')
        items = []

        for site in sites:
            #print site
            item = Website()     
            item['url'] = response.url
            item['response'] = response.status
            items.append(item)
            return items

你能告诉我们预期的输出是什么样子吗？{'response'：200，'url'：'''}对于@kaws中的每个链接，你找到了这个链接的答案了吗？我也处于同样的情况…@user299791没有解决方案，我选择了selenium而不是splash。