Python-Scrapy-创建一个爬虫程序，获取URL列表并对其进行爬网_Python_Scrapy_Scrapy Spider

Python-Scrapy-创建一个爬虫程序，获取URL列表并对其进行爬网

python scrapy

Python-Scrapy-创建一个爬虫程序，获取URL列表并对其进行爬网,python,scrapy,scrapy-spider,Python,Scrapy,Scrapy Spider,我正在尝试创建一个带有“Scrapy”包的爬行器，该包获取URL列表并对其进行爬网。我在stackoverflow中搜索了答案，但找不到解决问题的方法我的脚本如下： class Try(scrapy.Spider): name = "Try" def __init__(self, *args, **kwargs): super(Try, self).__init__(*args, **kwargs) self.start_urls = kwargs.get

我正在尝试创建一个带有“Scrapy”包的爬行器，该包获取URL列表并对其进行爬网。我在stackoverflow中搜索了答案，但找不到解决问题的方法

我的脚本如下：

class Try(scrapy.Spider):
   name = "Try"

   def __init__(self, *args, **kwargs):
      super(Try, self).__init__(*args, **kwargs)
      self.start_urls = kwargs.get( "urls" )
      print( self.start_urls )

   def start_requests(self):
      print( self.start_urls )
      for url in self.start_urls:
          yield Request( url , self.parse )

   def parse(self, response):
      d = response.xpath( "//body" ).extract()

当我爬行蜘蛛时：

Spider = Try(urls = [r"https://www.example.com"])
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(Spider)
process.start()

打印self.start\u URL时，我会打印以下信息：

在屏幕上打印的\uuuu init\uuuu功能中： [r”“]（传递给爬行器时）
在启动请求中，屏幕上打印的功能为：无

为什么我一个也没有？有没有其他办法解决这个问题？还是我的蜘蛛课上有错误

谢谢你的帮助

如果我跑

process.crawl(Try, urls=[r"https://www.example.com"])

然后它将

url

发送到

Try

，正如我所期望的那样。甚至我也不需要启动请求

import scrapy

class Try(scrapy.Spider):

   name = "Try"

   def __init__(self, *args, **kwargs):
       super(Try, self).__init__(*args, **kwargs)
       self.start_urls = kwargs.get("urls")

   def parse(self, response):
       print('>>> url:', response.url)
       d = response.xpath( "//body" ).extract()

from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Try, urls=[r"https://www.example.com"])
process.start()

但是如果我使用

spider = Try(urls = ["https://www.example.com"])

process.crawl(spider)

然后，它看起来像是运行new

Try

而不运行

url

，然后列表为空

我建议在

进程中使用Spider类。爬网并在那里传递URL
参数
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request


class Try(scrapy.Spider):
   name = 'Try'

   def __init__(self, *args, **kwargs):
      super(Try, self).__init__(*args, **kwargs)
      self.start_urls = kwargs.get("urls")

   def start_requests(self):
      for url in self.start_urls:
          yield Request( url , self.parse )

   def parse(self, response):
      d = response.xpath("//body").extract()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(Try, urls=[r'https://www.example.com'])
process.start()

使用列表的其他名称将URL保留在起始位置-self。起始URL
由scrapy
使用，因此它可以删除它们。我想知道你是否把URL放在self中。在\uu init\uuuuuu
中启动\u URL
，然后它可能会使用它们，而你不必使用启动\u请求
？过程。crawl
将在从\u crawler
调用时创建一个没有参数的新的Try对象。请参阅Scrapy源代码中的爬虫类。