Performance BS4：如何将find_all减少到最小值（忽略而不是提取）_Performance_Beautifulsoup_Findall

Performance BS4：如何将find_all减少到最小值（忽略而不是提取）

performance

Performance BS4：如何将find_all减少到最小值（忽略而不是提取）,performance,beautifulsoup,findall,Performance,Beautifulsoup,Findall,我需要在以后的操作中忽略注释和Doctype（因为我将替换一些字符，这些字符以后将不再允许我区分注释和Doctype）最小示例 #!/usr/bin/env python3 import re from bs4 import BeautifulSoup, Comment, Doctype def is_toremove(element): return isinstance(element, Comment) or isinstance(element, Doctype) de

我需要在以后的操作中忽略注释和Doctype（因为我将替换一些字符，这些字符以后将不再允许我区分注释和Doctype）

最小示例

#!/usr/bin/env python3
import re
from bs4 import BeautifulSoup, Comment, Doctype


def is_toremove(element):
    return isinstance(element, Comment) or isinstance(element, Doctype)


def test1():
    html = \
    '''
    <!DOCTYPE html>
    word1 word2 word3 word4
    <!-- A comment -->
    '''
    soup = BeautifulSoup(html, features="html.parser")
    to_remove = soup.find_all(text=is_toremove)
    for element in to_remove:
        element.extract()

    # some operations needing soup.findAll
    for txt in soup.findAll(text=True):
        # some replace computations
        pass
    return soup
print(test1())

仅使用未拆下的零件

因此，我的问题是：

有没有什么内在的魔力可以让你两次给findAll打电话而不会变得效率低下或效率低下

我怎样才能让他们两个都成为一个整体

我还尝试使用parent标记：

if(not isinstance(txt, Doctype)

或

比如说。这并没有改变我的主程序中的任何内容。

如评论中所述，如果您只想获得普通的

NavigableString

，可以执行以下操作：

from bs4 import BeautifulSoup, NavigableString


html = '''
<!DOCTYPE html>
word1 word2 word3 word4
<!-- A comment -->
'''

def is_string_only(t):
    return type(t) is NavigableString

soup = BeautifulSoup(html, 'lxml')

for visible_string in soup.find_all(text=is_string_only):
    print(visible_string)

如注释中所述，如果您只想获得普通的

NavigableString

，可以执行以下操作：

from bs4 import BeautifulSoup, NavigableString


html = '''
<!DOCTYPE html>
word1 word2 word3 word4
<!-- A comment -->
'''

def is_string_only(t):
    return type(t) is NavigableString

soup = BeautifulSoup(html, 'lxml')

for visible_string in soup.find_all(text=is_string_only):
    print(visible_string)

如果我理解正确，您是否只想选择字符串（而不是注释、doctype等）？正确。我只想更改打开网站的普通用户可以看到的部分。如果我理解正确，您是否只想选择字符串（而不是注释、doctype等）？正确。我只想改变的部分，这是由一个经常打开网站的用户可见。我为迟接受道歉；我几乎可以肯定你的答案是正确的，但还没有时间去检验它。现在我做到了，我可以凭良心接受你的回答；我几乎可以肯定你的答案是正确的，但还没有时间去检验它。现在我做到了，我可以凭良心接受你的回答。

from bs4 import BeautifulSoup, NavigableString


html = '''
<!DOCTYPE html>
word1 word2 word3 word4
<!-- A comment -->
'''

def is_string_only(t):
    return type(t) is NavigableString

soup = BeautifulSoup(html, 'lxml')

for visible_string in soup.find_all(text=is_string_only):
    print(visible_string)

word1 word2 word3 word4