Python 美化组解析html文件内容_Python_Html_Parsing_Web Scraping_Beautifulsoup

Python 美化组解析html文件内容

python html parsing web-scraping

Python 美化组解析html文件内容,python,html,parsing,web-scraping,beautifulsoup,Python,Html,Parsing,Web Scraping,Beautifulsoup,我在一个文件夹中有30911个html文件。我需要（1）检查它是否包含标签： <strong>123</strong> 顺便说一句，是否可以将内容保存为txt格式，但它看起来像html格式 line 1 line 2 ... lin 50 如果使用p.get_text（strip=true），则所有内容都在一起 line1 content line2 content ... line50 content.... 如果我理解正确，您可以首先找到起点-一个p元素，该元

我在一个文件夹中有30911个html文件。我需要（1）检查它是否包含标签：

<strong>123</strong>

顺便说一句，是否可以将内容保存为txt格式，但它看起来像html格式

line 1
line 2
...
lin 50

如果使用p.get_text（strip=true），则所有内容都在一起

line1 content line2 content ... 
line50 content....

如果我理解正确，您可以首先找到起点-一个

元素，该元素具有

strong

元素和“问答会话”文本。然后，您可以迭代

元素，直到找到一个

strong

元素中包含“版权政策”文本的元素

完整的可复制示例：

import re

from bs4 import BeautifulSoup


data = """
<body>
    <p class="p p4" id="question-answer-session">
      <strong>
       Question-and-Answer Session
      </strong>
    </p>

    <p class="p p4">
       Hi John and Greg, good afternoon. contents....
    </p>

    <p class="p p14">
      <strong>
       Copyright policy:
      </strong>
      other content about the policy....
    </p>
</body>
"""

soup = BeautifulSoup(data, "html.parser")

def find_question_answer(tag):
    return tag.name == 'p' and tag.find("strong", text=re.compile(r"Question-and-Answer Session"))

question_answer = soup.find(find_question_answer)
for p in question_answer.find_next_siblings("p"):
    if p.find("strong", text=re.compile(r"Copyright policy")):
        break

    print(p.get_text(strip=True))

如果我将内容写入一个新的html文件中，格式将会混乱。@MichaelLin好的，你想写入文件的哪一部分？我想我可以解决它，我使用p.prettify（）.encode（'ascii'，'ignore'）。decode（'utf-8'，'ignore'），然后它只保存版权之前的内容，但正如我在问题中提到的，还有另一个标记“related:”，因此它可能是“版权”或“相关的”，无论如何要解决它？@MichaelLin一个选择是调整正则表达式：

re.compile（r）（版权政策相关）

。。

line 1
line 2
...
lin 50

line1 content line2 content ... 
line50 content....

import re

from bs4 import BeautifulSoup


data = """
<body>
    <p class="p p4" id="question-answer-session">
      <strong>
       Question-and-Answer Session
      </strong>
    </p>

    <p class="p p4">
       Hi John and Greg, good afternoon. contents....
    </p>

    <p class="p p14">
      <strong>
       Copyright policy:
      </strong>
      other content about the policy....
    </p>
</body>
"""

soup = BeautifulSoup(data, "html.parser")

def find_question_answer(tag):
    return tag.name == 'p' and tag.find("strong", text=re.compile(r"Question-and-Answer Session"))

question_answer = soup.find(find_question_answer)
for p in question_answer.find_next_siblings("p"):
    if p.find("strong", text=re.compile(r"Copyright policy")):
        break

    print(p.get_text(strip=True))

Hi John and Greg, good afternoon. contents....