Python 从使用BeautifulSoup解析的HTML中删除标记_Python_Html_Parsing_Beautifulsoup

Python 从使用BeautifulSoup解析的HTML中删除标记

python html parsing

Python 从使用BeautifulSoup解析的HTML中删除标记,python,html,parsing,beautifulsoup,Python,Html,Parsing,Beautifulsoup,我是python新手，我正在使用BeautifulSoup解析网站，然后提取数据。我有以下代码： for line in raw_data: #raw_data is the parsed html separated into smaller blocks d = {} d['name'] = line.find('div', {'class':'torrentname'}).find('a') print d['name'] <a href="/ubuntu-

我是python新手，我正在使用BeautifulSoup解析网站，然后提取数据。我有以下代码：

for line in raw_data: #raw_data is the parsed html separated into smaller blocks
    d = {}
    d['name'] = line.find('div', {'class':'torrentname'}).find('a')
    print d['name']

<a href="/ubuntu-9-10-desktop-i386-t3144211.html">
<strong class="red">Ubuntu</strong> 9.10 desktop (i386)</a>

但是由于强大的html标记，它不会返回任何内容。有没有一种方法可以提取强标记，然后使用.string，还是有更好的方法？我曾尝试使用BeautifulSoup的extract（）函数，但无法使其正常工作

编辑：我刚刚意识到，如果有两组强标记，我的解决方案不起作用，因为单词之间的空格被省略了。解决此问题的方法是什么？

使用“.text”属性：

d['name'] = line.find('div', {'class':'torrentname'}).find('a').text

或者在findAll上执行联接（text=True）：

这不管用。在这样的示例中，它没有保留空格：UbuntuLinux。它被命名为UbuntuLinux。非常感谢，这非常有效！你能解释一下第二行代码是如何工作的吗？BeautifulSoup文档说文本参数允许你“搜索NavigableString对象而不是标记”。findAll返回一个python列表，然后可以将其连接在一起（.join）形成一个字符串。相关的：

d['name'] = line.find('div', {'class':'torrentname'}).find('a').text

anchor = line.find('div', {'class':'torrentname'}).find('a')
d['name'] = ''.join(anchor.findAll(text=True))