Beautifulsoup BeautifulStoneSoup-如何取消浏览并添加结束标记_Beautifulsoup

Beautifulsoup BeautifulStoneSoup-如何取消浏览并添加结束标记

Beautifulsoup BeautifulStoneSoup-如何取消浏览并添加结束标记,beautifulsoup,Beautifulsoup,我在这里编辑原始的帖子是为了澄清，希望我已经把它归结为更易于管理的东西。我有一个xml字符串，看起来像： <foo id="foo"> <row> <img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764"> </row> <row> <img alt=

我在这里编辑原始的帖子是为了澄清，希望我已经把它归结为更易于管理的东西。我有一个xml字符串，看起来像：

<foo id="foo">
    <row>
        &lt;img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764"&gt;
    </row>
    <row>
        &lt;img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225"&gt;
    </row>
</foo>

xml = BeautifulStoneSoup(someXml, selfClosingTags=['img'], convertEntities=BeautifulSoup.HTML_ENTITIES)

<foo id="foo">
    <row>
        <img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
    </row>
    <row>
        <img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
    </row>
</foo>

结果是：

<foo id="foo">
    <row>
        &lt;img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764"&gt;
    </row>
    <row>
        &lt;img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225"&gt;
    </row>
</foo>

xml = BeautifulStoneSoup(someXml, selfClosingTags=['img'], convertEntities=BeautifulSoup.HTML_ENTITIES)

<foo id="foo">
    <row>
        <img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764">
    </row>
    <row>
        <img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225">
    </row>
</foo>

它正在生成一个空列表。你知道为什么BeautifulStoneSoup在这段xml中找不到我的图像吗

找不到img标记的原因是BeautifulSoup将它们视为“行”标记的文本部分。转换实体只会更改字符串，而不会更改文档的底层结构。以下不是一个很好的解决方案（它会解析文档两次），但当我在您的示例xml上测试它时，它起到了作用。这里的想法是将文本转换为糟糕的xml，然后让Beauty soup再次清理它

soup = BeautifulSoup(BeautifulSoup(text,convertEntities=BeautifulSoup.HTML_ENTITIES).prettify())
print soup.findAll('img')

soup = BeautifulSoup(BeautifulSoup(text,convertEntities=BeautifulSoup.HTML_ENTITIES).prettify())
print soup.findAll('img')