Python 使用BeautifulSoup抓取标记之间的文本_Python_Html

Python 使用BeautifulSoup抓取标记之间的文本

python html

Python 使用BeautifulSoup抓取标记之间的文本,python,html,Python,Html,我正在尝试使用BeautifulSoup获取.txt文件中每个标记（在我的列表中）之间的每个单独文本片段，并将它们存储到字典中。这段代码可以工作，但如果我运行大文件，速度会非常慢，所以有没有其他方法可以让这段代码更快 from bs4 import BeautifulSoup words_dict = dict() # these are all of the tags in the file I'm looking for tags_list = ['title', 'h1', 'h2',

我正在尝试使用BeautifulSoup获取.txt文件中每个标记（在我的列表中）之间的每个单独文本片段，并将它们存储到字典中。这段代码可以工作，但如果我运行大文件，速度会非常慢，所以有没有其他方法可以让这段代码更快

from bs4 import BeautifulSoup

words_dict = dict()

# these are all of the tags in the file I'm looking for
tags_list = ['title', 'h1', 'h2', 'h3', 'b', 'strong']

def grab_file_content(file : str):
    with open(file, encoding = "utf-8") as file_object:
        # entire content of the file with tags
        content = BeautifulSoup(file_object, 'html.parser')

        # if the content has content within the <body> tags...
        if content.body:
            for tag in tags_list:
                for tags in content.find_all(tag):
                    text_list = tags.get_text().strip().split(" ")
                    for words in text_list:
                        if words in words_dict:
                            words_dict[words] += 1
                        else:
                            words_dict[words] = 1

       else:
            print('no body')

从bs4导入美化组
单词
#这些是我要找的文件中的所有标签
标签列表=['title'，'h1'，'h2'，'h3'，'b'，'strong']
def抓取文件内容（文件：str）：
打开（file，encoding=“utf-8”）作为文件对象：
#带有标记的文件的全部内容
content=BeautifulSoup（文件\对象'html.parser'）
#如果内容在标记中包含内容。。。
如果是content.body：
对于标记列表中的标记：
用于内容中的标记。查找所有（标记）：
text\u list=tags.get\u text（）.strip（）.split（“”）
对于文本列表中的单词：
如果单词中的单词是：
单词dict[words]+=1
其他：
单词dict[单词]=1
其他：
打印（'无正文'）

以下代码在功能上与您的代码等效：

from collections import Counter    
from itertools import chain

words_dict = Counter() # An empty counter further used as an accumulator

# Probably a loop
# Create the soup here, as in your original code
content = BeautifulSoup(file_object, 'html.parser')
words_dict += Counter(chain.from_iterable(tag.string.split()
                      for tag in content.find_all(tags_list) if tag.string))

您说希望标签之间有文本（比如说，和另一个之间），但在您的示例中，您提取了标签中的单词（即，和之间）。你想要什么？啊，是的，我想要两个标签中间的输入。例如，我的文本，我希望我的字典存储{My:1，Text:1}。谢谢你