Python 计算html文件中的短语频率_Python_Html_Frequency_Phrase

Python 计算html文件中的短语频率

python html

Python 计算html文件中的短语频率,python,html,frequency,phrase,Python,Html,Frequency,Phrase,我目前正在尝试适应Python，最近在我的编码中遇到了障碍。我无法运行一段代码来计算一个短语在html文件中出现的次数。我最近收到了一些帮助，帮助我构建用于计算文本文件中频率的代码，但我想知道是否有一种方法可以直接从html文件中执行此操作（绕过复制粘贴替代方法）。如有任何建议，我们将不胜感激。我之前使用的编码如下所示： #!/bin/env python 3.3.2 import collections import re # Defining a function named "findW

我目前正在尝试适应Python，最近在我的编码中遇到了障碍。我无法运行一段代码来计算一个短语在html文件中出现的次数。我最近收到了一些帮助，帮助我构建用于计算文本文件中频率的代码，但我想知道是否有一种方法可以直接从html文件中执行此操作（绕过复制粘贴替代方法）。如有任何建议，我们将不胜感激。我之前使用的编码如下所示：

#!/bin/env python 3.3.2
import collections
import re

# Defining a function named "findWords".
def findWords(filepath):
  with open(filepath) as infile:
    for line in infile:
      words = re.findall('\w+', line.lower())
      yield from words

phcnt = collections.Counter()

from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))   
next(fw2)
for w1,w2 in zip(fw1, fw2):
  phrase = ' '.join([w1, w2])
  if phrase in phrases:
    phcnt[phrase] += 1

print(phcnt)

你可以使用一些str.count（一些短语）函数

在进行分析之前剥离html标记怎么样？这项工作做得很好

import html2text
content = html2text.html2text(infile.read())

将为您提供文本内容（以某种方式格式化，但我认为这在您的方法中没有问题）。另外，还有一些选项可以忽略图像和链接，您可以像这样使用它们

h = html2text.HTML2Text()
h.ignore_images = True
h.ignore_links = True
content = h.handle(infile.read())

你可以使用

集合。Counter

@Ashish Nitin Patil:不幸的是，这只给了我一种计算单词的方法，而不是计算短语的方法，我发布的原始代码在文本文件上工作，但我想知道的是如何直接在html文件上使用它。

h = html2text.HTML2Text()
h.ignore_images = True
h.ignore_links = True
content = h.handle(infile.read())