Python 为什么BeautifulSoup无法正确读取/解析此RSS（XML）文档？_Python_Xml_Rss_Beautifulsoup

Python 为什么BeautifulSoup无法正确读取/解析此RSS（XML）文档？

python xml rss

Python 为什么BeautifulSoup无法正确读取/解析此RSS（XML）文档？,python,xml,rss,beautifulsoup,Python,Xml,Rss,Beautifulsoup,YCombinator非常好，可以提供一个和一个包含上的顶级项的。我试图编写一个python脚本来访问RSS提要文档，然后使用BeautifulSoup解析出某些信息。然而，当BeautifulSoup试图获取每个项目的内容时，我会有一些奇怪的行为以下是RSS提要的一些示例行： <rss version="2.0"> <channel> <title>Hacker News</title><link>http://news.ycomb

YCombinator非常好，可以提供一个和一个包含上的顶级项的。我试图编写一个python脚本来访问RSS提要文档，然后使用BeautifulSoup解析出某些信息。然而，当BeautifulSoup试图获取每个项目的内容时，我会有一些奇怪的行为

以下是RSS提要的一些示例行：

<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
    <title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title>
    <link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
    <title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
    <link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>

但是，此脚本提供的输出如下所示：

EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39; -  - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) -  - http://news.ycombinator.com/item?id=4943361
...

如您所见，中间的项目，

链接

，不知何故被省略了。也就是说，

link

的结果值在某种程度上是一个空字符串。那为什么呢

当我深入研究

汤中的内容时，我意识到它在解析XML时不知何故令人窒息。通过查看项目中的第一个项目可以看出这一点：
>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>

>打印项目[0]
EFF专利项目从马克·库班和'获得50万美元的资助；缺口'；https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notchhttp://news.ycombinator.com/item?id=4944322...

您会注意到，只有链接
标记出现了一些不正常的情况。它只获取close标记，然后获取该标记后面的文本。这是一种非常奇怪的行为，尤其是与解析标题
和注释
时没有问题相比
这似乎是BeautifulSoup的一个问题，因为请求实际读入的内容没有任何问题。但我不认为它仅限于BeautifulSoup，因为我也尝试使用xml.etree.ElementTree API，同样的问题也出现了（BeautifulSoup是基于此API构建的吗？）
有人知道为什么会发生这种情况，或者我如何仍然可以使用BeautifulSoup而不会出现此错误吗
注意：我终于能够通过xml.dom.minidom获得我想要的，但这似乎不是一个高度推荐的库。如果可能的话，我想继续使用BeautifulSoup
更新：我在一台Mac电脑上，OSX 10.8使用Python 2.7.2和BS4.1.3
更新2：我有lxml，它是用pip安装的。它是3.0.2版。至于libxml，我签入了/usr/lib，显示的是libxml2.2.dylib。不知道是什么时候或者如何安装的。
哇，好问题。这对我来说就像是美丽之城的一只虫子。无法使用soup.find_all（'item'）.link访问链接的原因是，当您首先将html加载到BeautifulSoup中时，它会对html执行一些奇怪的操作：
>>从bs4导入BeautifulSoup作为BS
>>>BS（html）
黑客ewshttp://news.ycombinator.com/Links
对于智力好奇者，按读者排名。
EFF专利项目从马克·库班和“否”获得50万美元的提振
tch'
https://www.eff.org/press/releases/eff-patent-project-gets-half-million-d
欧拉助推马克·库班和切赫
http://news.ycombinator.com/item?id=4944322
评论]]
珠穆朗玛峰的20亿像素照片（你能找到登山者吗？）
https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_
050112_8bit_FLAT.html
http://news.ycombinator.com/item?id=4943361
评论]]
...

仔细看——它实际上已将第一个
标记更改为
，然后删除了
标记。我不确定它为什么会这样做，但是如果不解决BeautifulSoup.BeautifulSoup
类初始化中的问题，您现在就无法使用它
更新：
我认为您目前最好（尽管是黑客）的选择是在链接中使用以下内容：
>>> soup.find('item').link.next_sibling
u'http://news.ycombinator.com/'

我不认为这里的美女群里有虫子
我从OSX10.8.2在苹果的股票2.7.2上安装了一个BS4.1.3的干净拷贝，一切正常。它不会将
错误地解析为
，因此项没有问题。查找（'link'）

我还尝试在2.7.2中使用stockxml.etree.ElementTree
和xml.etree.cElementTree
，在python.org 3.3.0中使用xml.etree.ElementTree
，来解析同样的东西，它再次运行良好。代码如下：
import xml.etree.ElementTree as ET

rss = ET.fromstring(x)
for channel in rss.findall('channel'):
  for item in channel.findall('item'):
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print(title)
    print(link)
    print(comments)

然后，我使用苹果内置的/usr/lib/libxml2.2.dylib
（根据xml2配置--version
，是2.7.8），安装了lxml 3.0.2（我相信BS使用lxml，如果可用的话），并使用其etree和BS进行了相同的测试，同样，一切都正常
除了拧紧
，jdotjdot的输出还显示BS4以一种奇怪的方式拧紧
。原文是这样的：
<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>



他的产出是：
<description>Comments]]&gt;</description>

注释]]

运行他完全相同的代码的结果是：
<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>



所以，这里似乎有一个更大的问题。奇怪的是，它发生在两个不同的人身上，而不是在一个干净的最新版本上
这意味着要么这是一个已经修复的bug，我只是有一个更新版本的bug，要么他们安装东西的方式很奇怪
BS4本身可以被排除，因为至少Treebranch和我一样有4.1.3。虽然不知道他是如何安装的，但这可能是安装的问题
Python及其内置的etree可以被排除在外，因为至少Treebranch在OS X 10.8中拥有与我相同的Apple 2.7.2版本
很可能是lxml或底层libxml的bug，或者是它们的安装方式。我知道jdotjdot有lxml2.3.6，所以这可能是一个在2.3.6和3.0.2之间修复的bug。事实上，根据2.3.5之后的版本和更改说明，没有2.3.6，所以不管他有什么，可能是从一个被取消的分支上很早就发布了一些bug或者其他什么…我不知道
<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>