Python 如何从html中获取文本，同时使用BeautifulSoup忽略格式化标记？_Python_Html_Python 3.x_Beautifulsoup_Bs4

Python 如何从html中获取文本，同时使用BeautifulSoup忽略格式化标记？

python html python-3.x

Python 如何从html中获取文本，同时使用BeautifulSoup忽略格式化标记？,python,html,python-3.x,beautifulsoup,bs4,Python,Html,Python 3.x,Beautifulsoup,Bs4,下面的代码用于从html中获取连续的文本段 for text in soup.find_all_next(text=True): if isinstance(text, Comment): # We found a comment, ignore continue if not text.strip(): # We found a blank text, ignore

下面的代码用于从html中获取连续的文本段

    for text in soup.find_all_next(text=True):
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print(text)

文本项由结构标记（如

或

和格式化标记（如

和

）分解。这给我进一步解析文本带来了一些不便，我希望能够在忽略文本内部的任何格式标记的同时获取连续文本项
例如，
soup.find_all_next（text=True）
将获取html代码
这是重要文本
并返回单个字符串，
这是重要文本
而不是三个字符串，
这是
，
重要
和
文本
我不确定这是否清楚。。。如果不是，请告诉我
编辑：我逐个文本项浏览html文本项的原因是，我只是在看到特定的“开始”注释标记后才开始浏览，而在到达特定的“结束”注释标记时才停止浏览。在需要逐项走查的情况下，有什么解决方案可以工作吗？下面是我使用的完整代码

soup = BeautifulSoup(page) for instanceBegin in soup.find_all(text=isBeginText): # We found a start comment, look at all text and comments: for text in instanceBegin.find_all_next(text=True): # We found a text or comment, examine it closely if isEndText(text): # We found the end comment, everybody out of the pool break if isinstance(text, Comment): # We found a comment, ignore continue if not text.strip(): # We found a blank text, ignore continue # Whatever is left must be good print(text)

其中，如果传递给这两个函数的字符串与我的开始或结束注释标记相匹配，则这两个函数
isBeginText（text）
和
isEndText（text）
返回true。
如果抓取包含子元素的父元素并执行
get_text（）
，BeautifulSoup将为您去除所有html标记，只返回连续的文本字符串
你可以找到一个例子

使用
find_all_next
两次，开始标记和结束标记各一次，并计算两个生成列表的差异，如何
作为一个例子，我将使用以下文件中的
html\u doc
的修改版本：

导入bs4 html_doc=“” 睡鼠的故事睡鼠的故事从前有三个小姐妹，她们的名字是 , 和 ; 他们住在井底 """ soup=bs4.BeautifulSoup（html_doc，'html.parser'） comments=soup.findAll（text=lambda text:isinstance（text，bs4.Comment）） #步骤1：找到开始和结束标记 node_start=[cmt for cmt in comments if cmt.string==“start”][0] node_end=[cmt for cmt in comments if cmt.string==“end”][0] #步骤2，从第一个字符串列表中减去第二个字符串列表所有文本=节点开始。查找所有下一步（文本=真）文本=节点\u结束后的所有\u。查找下一个（文本=真）子集=所有文本[：-（len（文本后的所有文本）+1）] 打印（子集） #['Lacie'，'and\n'，'Tillie'，'；\n他们住在井底
我想我应该注意到，这一切都必须发生在两个特定的“开始”和“结束”注释之间。这就是我逐项浏览html的原因。如果我对整个html文件使用
get_text（）
，我会从页眉和页脚中得到一堆我不想要的垃圾。我已经修改了我的原始问题以包含此上下文。您希望如何处理有两个嵌套块级别标记的情况？比如说
AB C
。你想要什么？无论如何，在我看来，您应该检查当前标记是否有子体。如果是这样，检查（递归地），如果这些后代是“格式化”类型（注意这是主观的：你认为<代码> EM <代码>是其中之一，但不是<代码> BR<代码>），如果是，删除格式化标签，但保留内部HTML。也许我没有完全理解你的问题，但这不能解决你的问题吗？是的，我听到了。实际上，除了保留基本的句子结构，我不关心任何格式。我可以忽略
，
等，只要保留句子（即，单词不会混在一起）。我知道
soup.get_text（）
方法，但我不确定如何将其应用到有关开始和结束标记的特定约束中（请参见编辑我的原始问题）。@OliverW。当然：开始标记是注释标记，
，结束标记也是注释标记，
。我想要这两个注释标记之间的所有文本。如果有新行或换行符，只要能保留完整的句子和单词，我很乐意将其替换为空格。给我几分钟的时间来尝试一下。效果很好！哇！我不得不对其进行一些修改，以便将其塞进我的代码中，但输出不仅保留了句子结构，还保留了格式。如果我想的话，我可以把它直接写进文件。谢谢你，伙计！唯一的问题是，它阅读评论和文本。在
find\u all\u next
方法中，可以从搜索中排除注释吗？也许我可以使用
extract
方法从所选文本体中删除注释（我需要的两个注释除外）？我试试看，就是这样。在我们用
soup.findAll
创建了一个注释列表之后，如果cmt.string！=start和cmt.string！=end，我在注释中为cmt添加了一行
[cmt.extract（）。今晚我学到了一些新东西。你是个巫师，谢谢你花时间！我确实想为这个问题的未来观众提出一个可能的替代方案（在奥利弗·W.的优秀解决方案之前我就有了），那就是亚伦·斯瓦茨的html2text。它可以将html很好地简化为markdown语言中的文本。它工作得很好，但对于我在这里所做的事情来说，它并不是最好的解决方案。仅供参考。 from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') print(soup.get_text()) import bs4 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ soup = bs4.BeautifulSoup(html_doc, 'html.parser') comments = soup.findAll(text=lambda text:isinstance(text, bs4.Comment)) # Step 1: find the beginning and ending markers node_start = [ cmt for cmt in comments if cmt.string == " START" ][0] node_end = [ cmt for cmt in comments if cmt.string == " END " ][0] # Step 2, subtract the 2nd list of strings from the first all_text = node_start.find_all_next(text=True) all_after_text = node_end.find_all_next(text=True) subset = all_text[:-(len(all_after_text) + 1)] print(subset) # ['Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.']