如何通过python中的Beauty soup在html页面中查找特定单词?
我想通过html文本中的BeautifulSoup,找出一个特定单词在网页中出现了多少次? 我尝试了如何通过python中的Beauty soup在html页面中查找特定单词?,python,python-2.7,beautifulsoup,Python,Python 2.7,Beautifulsoup,我想通过html文本中的BeautifulSoup,找出一个特定单词在网页中出现了多少次? 我尝试了findAll函数,但只在特定标记(如soup.body)中查找单词。findAll将在body标记中查找特定单词,但我希望它在html文本中的所有标记中搜索该单词。 还有,一旦我找到那个单词,我需要创建一个单词前后的列表,有人能帮我怎么做吗?谢谢。根据,您可以使用recursive关键字在整个树中查找文本。您将拥有字符串,然后您可以对其进行运算符运算并分隔单词 下面是一个完整的示例: impor
findAll
函数,但只在特定标记(如soup.body)中查找单词。findAll
将在body标记中查找特定单词,但我希望它在html文本中的所有标记中搜索该单词。
还有,一旦我找到那个单词,我需要创建一个单词前后的列表,有人能帮我怎么做吗?谢谢。根据,您可以使用recursive
关键字在整个树中查找文本。您将拥有字符串,然后您可以对其进行运算符运算并分隔单词
下面是一个完整的示例:
import bs4
import re
data = '''
<html>
<body>
<div>today is a sunny day</div>
<div>I love when it's sunny outside</div>
Call me sunny
<div>sunny is a cool word sunny</div>
</body>
</html>
'''
searched_word = 'sunny'
soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)
print 'Found the word "{0}" {1} times\n'.format(searched_word, len(results))
for content in results:
words = content.split()
for index, word in enumerate(words):
# If the content contains the search word twice or more this will fire for each occurence
if word == searched_word:
print 'Whole content: "{0}"'.format(content)
before = None
after = None
# Check if it's a first word
if index != 0:
before = words[index-1]
# Check if it's a last word
if index != len(words)-1:
after = words[index+1]
print '\tWord before: "{0}", word after: "{1}"'.format(before, after)
可能重复的否它不是重复的,我检查了Results=soup.body.find_all(string=searched_word,recursive=true)name错误:名称“true”未定义我已下载了4.3版/I用完整的工作示例更新了答案,请再次检查我得到的“find the word”sunny“0次”使用Python2.7.3的RU?我只是复制粘贴的示例代码似乎
string
关键字是在版本4.4中添加的,所以请使用该关键字或将soup.body.find_all(string=…)
更改为soup.body.find_all(text=…)
(4.3及之前版本的关键字不同)
Found the word "sunny" 4 times
Whole content: "today is a sunny day"
Word before: "a", word after: "day"
Whole content: "I love when it's sunny outside"
Word before: "it's", word after: "outside"
Whole content: "
Call me sunny
"
Word before: "me", word after: "None"
Whole content: "sunny is a cool word sunny"
Word before: "None", word after: "is"
Whole content: "sunny is a cool word sunny"
Word before: "word", word after: "None"