如何使用scrapy-XMLFeedSpider从xml中提取URL？_Scrapy

如何使用scrapy-XMLFeedSpider从xml中提取URL？

scrapy

如何使用scrapy-XMLFeedSpider从xml中提取URL？,scrapy,Scrapy,我最近开始使用Scrapy，我正在尝试使用“XMLFeedSpider”来提取和加载xml页面中的页面。但问题是它返回了一个错误：“IndexError:列表索引超出范围” 我正在尝试收集并加载位于以下地址的所有产品页面：“” 我的蜘蛛： from scrapy.spiders import XMLFeedSpider class PartySpider(XMLFeedSpider): name = 'example' allowed_domains = ['http://ww

我最近开始使用Scrapy，我正在尝试使用“XMLFeedSpider”来提取和加载xml页面中的页面。但问题是它返回了一个错误：“IndexError:列表索引超出范围”

我正在尝试收集并加载位于以下地址的所有产品页面：
“”

我的蜘蛛：

from scrapy.spiders import XMLFeedSpider

class PartySpider(XMLFeedSpider):
    name = 'example'
    allowed_domains = ['http://www.example.com']

    start_urls = [      
        'http://www.example.com/feed.xml'
    ]   

    itertag = 'loc'

    def parse_node(self, response, node): 
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))

从scrapy.spider导入XMLFeedSpider
PartySpider类（XMLFeedSpider）：
名称='示例'
允许的_域=['http://www.example.com']
起始URL=[
'http://www.example.com/feed.xml'
]   
itertag='loc'
def parse_节点（自身、响应、节点）：
self.logger.info（'Hi，这是一个节点！：%s'，self.itertag'，.join（node.extract（）））

这是XML输入的启动方式：

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.example.htm</loc></url>
<url><loc>http://www.example.htm</loc></url>
(...)

当您获得

索引器时，请共享您的stacktrace错误：列表索引超出范围

当然，我已经添加了我的跟踪。非常感谢Paul！显然，这仍然是一个bug，因为这是我让它工作的唯一方法。这与文档不同

from scrapy.spiders import XMLFeedSpider

class PartySpider(XMLFeedSpider):
    name = 'example'
    allowed_domains = ['example.com']

    start_urls = [      
        'http://www.example.com/example.xml'
    ]   

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:loc'
    iterator = 'xml'

    def parse_node(self, response, node): 
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))