Python lxml-难以解析stackexchange rss源_Python_Rss_Xml Parsing_Lxml

Python lxml-难以解析stackexchange rss源

python rss

Python lxml-难以解析stackexchange rss源,python,rss,xml-parsing,lxml,Python,Rss,Xml Parsing,Lxml,Hia 我在用python解析stackexchange的rss提要时遇到问题。当我尝试获取摘要节点时，返回一个空列表我一直在试图解决这个问题，但我的头脑还没有清醒过来有人能帮忙吗？谢谢 a 在[3o]中：导入lxml.etree，urllib2 In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' In [32]: cooking_content = urllib2.urlopen(url_cooking)

Hia

我在用python解析stackexchange的rss提要时遇到问题。当我尝试获取摘要节点时，返回一个空列表

我一直在试图解决这个问题，但我的头脑还没有清醒过来

有人能帮忙吗？谢谢 a


在[3o]中：导入lxml.etree，urllib2
In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' 

In [32]: cooking_content = urllib2.urlopen(url_cooking)

In [33]: cooking_parsed = lxml.etree.parse(cooking_content)

In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')

In [35]: cooking_texts
Out[35]: []

尝试从导入中使用BeautifulsToneSoop。

它可能会起作用。

看看这两个版本

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

正如您所发现的，第二个版本不返回任何节点，但是

lxml.html

版本工作正常。

etree

版本不工作，因为它需要名称空间；

html

版本工作，因为它忽略名称空间。在某种程度上，它说“HTML解析器明显忽略了名称空间和其他一些XMLISM。”

注意：当您打印etree版本的根节点（

print（data.getroot（））

）时，您会得到类似于

的结果。这意味着它是一个命名空间为

http://www.w3.org/2005/Atom

。以下是etree代码的更正版本

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}

data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

问题是名称空间

运行以下命令：

 cooking_parsed.getroot().tag

您将看到元素的名称空间为

{http://www.w3.org/2005/Atom}feed

类似地，如果导航到某个提要条目

这意味着lxml中正确的xpath是：

print cooking_parsed.xpath(
  "//a:feed/a:entry",
  namespaces={ 'a':'http://www.w3.org/2005/Atom' })

data.xpath（'//ns:feed/ns:entry/ns:summary'，名称空间={'ns'：'http://www.w3.org/2005/Atom“}）

gah，难怪！看起来api在某个时候重命名了

名称空间

关键字。正在用工作代码更新我的示例。非常感谢您。在开始解析之前，我将开始检查根目录。我花了大约3个小时才弄明白这一点！非常感谢！不知何故，我怀疑这个答案对你来说比对我更容易羞怯地颠簸你的答案，自由地指出我在我的答案中犯的任何错误。