使用Python beautifulsoup选择除特定标记之外的所有内容_Python_Beautifulsoup

使用Python beautifulsoup选择除特定标记之外的所有内容

python

使用Python beautifulsoup选择除特定标记之外的所有内容,python,beautifulsoup,Python,Beautifulsoup,我有1000多个html文件，它们有不同的格式、元素和内容。我需要递归地遍历每个元素并选择除元素之外的所有元素这是一个示例文件（请注意，这是文件中最小和最简单的文件，其余文件大得多，更复杂，有许多不同的元素不符合任何单个模板，除了以元素开头）：我希望这将选择元素下面的所有内容，但是它没有。使用soup.select（“h1”）只选择一行，而不选择它下面的所有内容。我该怎么办？您是否考虑过使用.decompose（）删除..元素，然后只获取所有剩余部分？您是否考虑过使用.decompose（）

我有1000多个html文件，它们有不同的格式、元素和内容。我需要递归地遍历每个元素并选择除

元素之外的所有元素

这是一个示例文件（请注意，这是文件中最小和最简单的文件，其余文件大得多，更复杂，有许多不同的元素不符合任何单个模板，除了以

元素开头）：

我希望这将选择

元素下面的所有内容，但是它没有。使用

soup.select（“h1”）

只选择一行，而不选择它下面的所有内容。我该怎么办？

您是否考虑过使用

.decompose（）

删除

..

元素，然后只获取所有剩余部分？

您是否考虑过使用

.decompose（）

删除

..

元素，然后只获取所有剩余部分？

使用

.extract（）

删除选定的标记

output = None
with open("file.htm") as ip:
    #HTML parsing done using the "html.parser".
    soup = BeautifulSoup(ip, "html.parser")
    soup.h1.extract()
    output = soup

print(output)

使用

.extract（）

删除所选标记

output = None
with open("file.htm") as ip:
    #HTML parsing done using the "html.parser".
    soup = BeautifulSoup(ip, "html.parser")
    soup.h1.extract()
    output = soup

print(output)

output = None
with open("file.htm") as ip:
    #HTML parsing done using the "html.parser".
    soup = BeautifulSoup(ip, "html.parser")
    soup.h1.extract()
    output = soup

print(output)