Python:Scrapy Splash使用爬行蜘蛛递归爬行不工作
我在我的爬行蜘蛛中集成了scrapy splash,它只会爬行呈现开始URL。想知道如何让scrapy splash抓取内部链接。 我一直在互联网上寻找解决方案,但似乎没有一个可行的解决方案 以下是我的代码:Python:Scrapy Splash使用爬行蜘蛛递归爬行不工作,scrapy,splash-screen,scrapy-splash,Scrapy,Splash Screen,Scrapy Splash,我在我的爬行蜘蛛中集成了scrapy splash,它只会爬行呈现开始URL。想知道如何让scrapy splash抓取内部链接。 我一直在互联网上寻找解决方案,但似乎没有一个可行的解决方案 以下是我的代码: import scrapy from scrapy.selector import Selector from scrapy.linkextractors import LinkExtractor from scrapy.spiders import Rule, CrawlSpider
import scrapy
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from scrapy.item import Item, Field
from scrapy import Request
#import requests
from scrapy_splash import SplashRequest
class Website(scrapy.Item):
url = Field()
response = Field()
class houzzspider(CrawlSpider):
handle_httpstatus_list = [404, 500]
name = "example"
allowed_domains = ["localhost","www.example.com"]
start_urls = ["https://www.example.com/"]
rules = (
Rule(
LinkExtractor(
allow=(),
deny=(),process_value=''),
callback="parse_items",
process_links="process_links",
follow=True,
),
Rule(
LinkExtractor(
allow=(),
deny=(),process_value=''),
follow=True,
),
)
def process_links(self, links):
for link in links:
if "http://localhost:8050/render.html?&" not in link.url:
link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,'wait':2.0})
return links
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse_items,
endpoint='render.html',
args={'wait': 0.5},)
def parse_items(self, response):
hxs = Selector(response)
sites = response.selector.xpath('//html')
items = []
for site in sites:
#print site
item = Website()
item['url'] = response.url
item['response'] = response.status
items.append(item)
return items
你能告诉我们预期的输出是什么样子吗?{'response':200,'url':'''}对于@kaws中的每个链接,你找到了这个链接的答案了吗?我也处于同样的情况…@user299791没有解决方案,我选择了selenium而不是splash。