Python HTML文档中所有元素名称的列表—；美丽之群_Python_Web Scraping_Beautifulsoup

Python HTML文档中所有元素名称的列表—；美丽之群

python web-scraping

Python HTML文档中所有元素名称的列表—；美丽之群,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想得到一个包含HTML文档所有不同标记名的列表（一个不重复的标记名字符串列表）。我试着用soup.findall（）放入空条目，但这给了我整个文档有办法吗？使用soup.findall（）可以得到一个可以迭代的每个元素的列表。因此，您可以执行以下操作： from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title><

我想得到一个包含HTML文档所有不同标记名的列表（一个不重复的标记名字符串列表）。我试着用

soup.findall（）

放入空条目，但这给了我整个文档

有办法吗？

使用

soup.findall（）

可以得到一个可以迭代的每个元素的列表。因此，您可以执行以下操作：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""  # an html sample
soup = BeautifulSoup(html_doc, 'html.parser')

document = soup.html.find_all()

el = ['html',]  # we already include the html tag
for n in document:
    if n.name not in el:
        el.append(n.name)

print(el)

编辑正如@pm2ring在那里指出的，如果您不关心元素添加的顺序（正如他所说，我认为情况并非如此），那么您可以使用集合。在Python3.x中，您不必导入它，但是如果使用较旧的版本，您可能需要检查它是否受支持

from bs4 import BeautifulSoup

...

el = {x.name for x in document} # use a set comprehension to generate it easily
el.add("html")  # only if you need to

使用

soup.findall（）

可以得到可以迭代的每个元素的列表。因此，您可以执行以下操作：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""  # an html sample
soup = BeautifulSoup(html_doc, 'html.parser')

document = soup.html.find_all()

el = ['html',]  # we already include the html tag
for n in document:
    if n.name not in el:
        el.append(n.name)

print(el)

from bs4 import BeautifulSoup

...

el = {x.name for x in document} # use a set comprehension to generate it easily
el.add("html")  # only if you need to

简单明了。好主意。我想它也适用于标题标记，不是吗？顺便说一句，您缺少了其中的html标记。@wonderwhy-edited。现在它也包括在内。是的，它还迭代

标记。这方面的输出将是：

['html'、'head'、'title'、'body'、'p'、'b'、'a']

使用集合而不是列表进行

el

测试效率更高，因为您无需费心在

测试中执行。当然，集合不能保持顺序，但这可能不是问题。如果OP真的需要一个列表，那么很容易将集合转换为末尾的列表。@firelite首先将URL放在字符串标记中。简单明了。好主意。我想它也适用于标题标记，不是吗？顺便说一句，您缺少了其中的html标记。@wonderwhy-edited。现在它也包括在内。是的，它还迭代
标记。这方面的输出将是：['html'、'head'、'title'、'body'、'p'、'b'、'a']
使用集合而不是列表进行el
测试效率更高，因为您无需费心在测试中执行。当然，集合不能保持顺序，但这可能不是问题。如果OP真的需要一个列表，那么很容易将集合转换为末尾的列表。@firelite首先将URL放在字符串标记中。