Python 使用BeautifulSoup从HTML检索信息-同一文本多次出现?
我有以下格式的HTML文档:Python 使用BeautifulSoup从HTML检索信息-同一文本多次出现?,python,html,python-3.x,beautifulsoup,Python,Html,Python 3.x,Beautifulsoup,我有以下格式的HTML文档: <html><body><h2>Lorem ipsum <span name="datetime" class="0">dolor <strong> sit</strong></span> amet, consectetur adipiscing elit.</h2> <p>Morbi sit amet malesuada nisl. <
<html><body><h2>Lorem ipsum <span name="datetime" class="0">dolor <strong>
sit</strong></span> amet, consectetur adipiscing elit.</h2>
<p>Morbi sit amet malesuada nisl. <span name="address" class="1">Phasellus <strong>rhoncus diam</strong> sit amet augue dictum</span>,
porta interdum odio tempus.</p></body></html>
我的代码:
from bs4 import BeautifulSoup
input_file = BeautifulSoup(open("ex2.html", 'r'), 'lxml')
tags = input_file.find_all()
word_list = []
name_list = []
translator = str.maketrans(":[];.,#&*\\/", " ")
for tag in tags:
try:
name = tag.attrs['name']
except:
name = None
words = tag.text.translate(translator)
words = words.split(" ")
for word in words:
if words != '':
word_list.append(word)
name_list.append(name)
print(word_list)
print(name_list)
我的输出:
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', '', 'consectetur', 'adipiscing', 'elit', 'Morbi', 'sit', 'amet', 'malesuada', 'nisl', '', 'Phasellus', 'rhoncus', 'diam', 'sit', 'amet', 'augue', 'dictum', '', 'porta', 'interdum', 'odio', 'tempus', '\n', 'Lorem', 'ipsum', 'dolor', 'sit', 'amet', '', 'consectetur', 'adipiscing', 'elit', 'Morbi', 'sit', 'amet', 'malesuada', 'nisl', '', 'Phasellus', 'rhoncus', 'diam', 'sit', 'amet', 'augue', 'dictum', '', 'porta', 'interdum', 'odio', 'tempus', '\n', 'Lorem', 'ipsum', '', 'dolor', 'sit', 'dolor', 'sit', 'sit', 'Morbi', 'sit', 'amet', 'malesuada', 'nisl', '', 'Phasellus', 'rhoncus', 'diam', 'sit', 'amet', 'augue', 'dictum', '', 'porta', 'interdum', 'odio', 'tempus', '', 'Phasellus', 'rhoncus', 'diam', 'sit', 'amet', 'augue', 'dictum', 'rhoncus', 'diam']
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'datetime', 'datetime', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'address', 'address', 'address', 'address', 'address', 'address', 'address', None, None]
问题是A.一些文本在标记中出现多次,我不知道如何修复它
B有些单词是空的(“”),但即使我在if块中检查,它仍然会被添加到列表中
如果有人能给我一些提示,那将非常有帮助:)您可以通过以下方式提取文本:
text = input_file.text.replace("\n" , " ")
words = text.split()
这将产生:
'Lorem',
'ipsum',
'dolor',
'sit',
'amet,',
'consectetur',
'adipiscing',
'elit.',
'Morbi',
'sit',
'amet',
'malesuada',
'nisl.',
'Phasellus',
'rhoncus',
'diam',
'sit',
'amet',
'augue',
'dictum,',
'porta',
'interdum',
'odio',
'tempus.'
datetime
address
对于其他列表,您可以尝试:
tags = input_file.find_all("span")
for s in tags :
if "name" in s.attrs:
print(s["name"])
这将产生:
'Lorem',
'ipsum',
'dolor',
'sit',
'amet,',
'consectetur',
'adipiscing',
'elit.',
'Morbi',
'sit',
'amet',
'malesuada',
'nisl.',
'Phasellus',
'rhoncus',
'diam',
'sit',
'amet',
'augue',
'dictum,',
'porta',
'interdum',
'odio',
'tempus.'
datetime
address
啊,我找到了一个解决办法,很抱歉浪费了你的时间!我试了几个小时,没有找到解决办法,但现在我能找到了。 如果有人感兴趣:
from bs4 import BeautifulSoup
input_file = BeautifulSoup(open("ex2.html", 'r'), 'lxml')
tags = input_file.contents[0]
word_list = []
name_list = []
translator = str.maketrans(":[];.,#&*\\/", " ")
def recurse(tags, name):
for tag in tags:
try:
this_name = tag.attrs['name']
except:
this_name = name
if tag.string == None:
recurse(tag, this_name)
else:
words = tag.string.translate(translator)
words = words.split(" ")
for word in words:
if word != '':
word_list.append(word)
name_list.append(this_name)
recurse(tags, None)
谢谢你的回复!问题是,我需要访问每个单词的名称,其中有些单词没有跨距,因此没有跨距,有些单词有跨距名称,因此第二个列表的长度必须与第一个列表的长度相等。