Python 使用scrapy从xml中提取链接_Python_Scrapy

Python 使用scrapy从xml中提取链接

python scrapy

Python 使用scrapy从xml中提取链接,python,scrapy,Python,Scrapy,我有一个具有以下结构的xml页面： <item> <pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate> <title> some text </title> <link> http://www.example.com/index.xml </link> ... 但我不知道如何跟随“文本”标签。实际上，我已经尝试了linkextr

我有一个具有以下结构的xml页面：

<item>
  <pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate>
  <title>
   some text
  </title>
  <link>
     http://www.example.com/index.xml
  </link>
  ...

但我不知道如何跟随“文本”标签。实际上，我已经尝试了

linkextractor

tags='links'

选项，但没有效果。日志有效地进入页面，得到200个回复，但没有任何链接。

您应该使用xml.etree库

import xml.etree.ElementTree as ET



res = '''
<item>
  <pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate>
  <title>
   some text
  </title>
  <link>
     http://www.example.com/index.xml
  </link>
</item>
'''

root = ET.fromstring(res)
results = root.findall('.//link')
for res in results:
    print res.text

这里的关键问题是，这不是一个常规的HTML输入，而是一个XML提要和链接在元素文本中，而不是属性中。我想你只需要在这里：

使用

LinkExtractor

中的

restrict\u xpath='//link'

从链接获取链接tag@Vaulstein：谢谢，但是运气不好。如果我在scrapy控制台上执行

response.xpath（“//item/link/text（）”）.extract（）”

，它确实会返回链接的文本，但如果它在主代码中执行此操作，它肯定不会跟随它们。这里的关键问题是链接在元素文本中，而不是属性中。链接提取器默认情况下从

href

属性提取链接，我认为它们是为了从属性中获取链接而设计的，但我很确定您可以改为指向文本。@alecxe:我也这么认为，我尝试使用

LinkExtractor

中的

标记

参数，但我也没能得到links@DervinThunk如果之前没有人提供有效的答案，我一定会稍后再看。谢谢，谢谢你。我确实得到了链接，只是没能跟上。此外，我看到您正在使用lxml，我想留在scrapy库中学习。

import xml.etree.ElementTree as ET



res = '''
<item>
  <pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate>
  <title>
   some text
  </title>
  <link>
     http://www.example.com/index.xml
  </link>
</item>
'''

root = ET.fromstring(res)
results = root.findall('.//link')
for res in results:
    print res.text

http://www.example.com/index.xml

import scrapy
from scrapy.spiders import XMLFeedSpider

class MySpider(XMLFeedSpider):
    name = 'myspider'
    start_urls = ['url_here']

    itertag = "item"

    def parse_node(self, response, node):
        for link in node.xpath(".//link/text()").extract():
            yield scrapy.Request(link.strip(), callback=self.parse_link)

    def parse_link(self, response):
        print(response.url)