Python 如何使刮痧蜘蛛更快_Python_Xml_Web Scraping_Scrapy

Python 如何使刮痧蜘蛛更快

python xml web-scraping scrapy

Python 如何使刮痧蜘蛛更快,python,xml,web-scraping,scrapy,Python,Xml,Web Scraping,Scrapy,我正在抓取的xml提要大约有1000个条目。我想知道是否有一种方法可以分割负载，或者有另一种方法可以显著减少运行时间。目前迭代下面链接中的所有xml需要两分钟。非常感谢您的任何建议例如：所有项目的两分钟运行时间。使用Scrapy有什么方法可以让它更快？尝试增加并发\u请求、每个\u域的并发\u请求、每个\u IP的并发\u请求，例如：但请记住，除了高速外，它还可能导致较低的成功率，就像许多429响应、禁止等一样。我在本地测试了以下spider：从scrapy.spider导入XMLFee

我正在抓取的xml提要大约有1000个条目。我想知道是否有一种方法可以分割负载，或者有另一种方法可以显著减少运行时间。目前迭代下面链接中的所有xml需要两分钟。非常感谢您的任何建议

例如：

所有项目的两分钟运行时间。使用Scrapy有什么方法可以让它更快？

尝试增加并发\u请求、每个\u域的并发\u请求、每个\u IP的并发\u请求，例如：

但请记住，除了高速外，它还可能导致较低的成功率，就像许多429响应、禁止等一样。

我在本地测试了以下spider：

从scrapy.spider导入XMLFeedSpider 类MySpiderXMLFeedSpider：名称='测试' 允许的_域=['www.cityblueshop.com'] 起始URL=['https://www.cityblueshop.com/sitemap_products_1.xml'] 名称空间=['n'，'http://www.sitemaps.org/schemas/sitemap/0.9'] itertag='n:url' 迭代器='xml' def parse_nodeself，响应，节点：产生{'url'：node.xpath'.//n:loc/text.get} 运行不到3秒，包括Scrapy core启动和所有操作

请确保时间没有花在其他地方，例如，在导入项目子类的学习模块中。

所有内容都在同一页面上，并且只是通过该页面进行解析，这只是主xml页面oops需要的一个请求，我的错误，误读。在这种情况下，是的，这些设置是无用的，对不起。

from scrapy.spiders import XMLFeedSpider
from learning.items import TestItem
class MySpider(XMLFeedSpider):
    name = 'testing'
    allowed_domains = ['www.cityblueshop.com']
    start_urls = ['https://www.cityblueshop.com/sitemap_products_1.xml'] 

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:url'
    iterator = 'xml'


    def parse_node(self, response, node):

        item = TestItem()
        item['url'] = node.xpath('.//n:loc/text()').extract()


        return item