Python 通用提要解析器问题
我正在编写一个python脚本来解析RSS链接 我使用了,我遇到了一些链接问题,例如在尝试解析 以下是示例代码:Python 通用提要解析器问题,python,xml,rss,feedparser,Python,Xml,Rss,Feedparser,我正在编写一个python脚本来解析RSS链接 我使用了,我遇到了一些链接问题,例如在尝试解析 以下是示例代码: feed = feedparser.parse(url) items = feed["items"] 基本上,提要[“items”]应该返回提要中的所有条目,即以item开头的字段,但它总是返回空的 我还可以确认以下链接已按预期进行解析: 这是feed的一个问题吗,因为来自FreeBSD的feed既不遵守标准,也不遵守标准 编辑: 我正在使用python 2
feed = feedparser.parse(url)
items = feed["items"]
基本上,提要[“items”]应该返回提要中的所有条目,即以item开头的字段,但它总是返回空的
我还可以确认以下链接已按预期进行解析:
def rss_get_items_feedparser(self, webData):
feed = feedparser.parse(webData)
items = feed["items"]
return items
def rss_get_items_beautifulSoup(self, webData):
soup = BeautifulSoup(webData)
for item_node in soup.find_all('item'):
item = {}
for subitem_node in item_node.findChildren():
if subitem_node.name is not None:
item[str(subitem_node.name)] = str(subitem_node.contents[0])
yield item
def rss_get_items(self, webData):
items = self.rss_get_items_feedparser(webData)
if (len(items) > 0):
return items;
return self.rss_get_items_beautifulSoup(webData)
def parse(self, url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
webData = response .read()
for item in self.rss_get_items(webData):
#parse items
我还尝试将响应直接传递到rss_get_项,而不阅读它,但当BeautifulSoup尝试阅读时,它会抛出和异常:
File "bs4/__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
我发现问题在于名称空间的使用 对于FreeBSD的RSS源:
<rss xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="http://www.w3.org/1999/xhtml"
version="2.0">
输出:
FreeBSD-SA-14:04.bind
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
---
FreeBSD-SA-14:03.openssl
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
---
...
注:
- 为了简洁起见,我省略了错误检查
- 我建议仅在
失败时使用feedparser
API。原因是BeautifulSoup
是该作业的正确工具。希望他们将来能更新它,让它更宽容feedparser
from bs4 import BeautifulSoup
import urllib2
def rss_get_items(url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
for item_node in soup.find_all('item'):
item = {}
for subitem_node in item_node.findChildren():
key = subitem_node.name
value = subitem_node.text
item[key] = value
yield item
if __name__ == '__main__':
url = 'http://www.freebsd.org/security/rss.xml'
for item in rss_get_items(url):
print item['title']
print item['pubdate']
print item['link']
print item['guid']
print '---'
FreeBSD-SA-14:04.bind
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
---
FreeBSD-SA-14:03.openssl
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
---
...