使用python打印文本文档中10个最不常用的单词_Python_Python 2.6_Defaultdict

使用python打印文本文档中10个最不常用的单词

python

使用python打印文本文档中10个最不常用的单词,python,python-2.6,defaultdict,Python,Python 2.6,Defaultdict,我有一个小python脚本，它可以打印文本文档中10个最常见的单词（每个单词有2个字母或更多），我还需要继续使用该脚本来打印文档中10个最不常见的单词。我有一个相对有效的脚本，除了它打印的10个最不常用的单词是数字（整数和浮点数），它们应该是单词。如何只迭代单词而排除数字？以下是我的完整脚本： # Most Frequent Words: from string import punctuation from collections import defaultdict number = 10

我有一个小python脚本，它可以打印文本文档中10个最常见的单词（每个单词有2个字母或更多），我还需要继续使用该脚本来打印文档中10个最不常见的单词。我有一个相对有效的脚本，除了它打印的10个最不常用的单词是数字（整数和浮点数），它们应该是单词。如何只迭代单词而排除数字？以下是我的完整脚本：

# Most Frequent Words:
from string import punctuation
from collections import defaultdict

number = 10
words = {}

with open("charactermask.txt") as txt_file:
    words = [x.strip(punctuation).lower() for x in txt_file.read().split()]

counter = defaultdict(int)

for word in words:
  if len(word) >= 2:
    counter[word] += 1

top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)

编辑：文档的结尾（最不频繁的单词注释下的部分）是需要修复的部分。

您需要一个函数，

letters\u only（）

，它将运行与

[0-9]

匹配的正则表达式，如果找到任何匹配项，则返回False。大概是这样的：：

def letters_only(word):
    return re.search(r'[0-9]', word) is None

然后，如果对word-in-words说

，而不是对word-in-filter说（仅限字母，单词）
，您将需要一个过滤器——更改正则表达式以匹配您想要定义的“单词”：
现在，您希望单词频率表首先不包括数字吗
counter = defaultdict(int)

with open("charactermask.txt") as txt_file:
    for line in txt_file:
        for word in line.strip().split():
          word = word.strip(punctuation).lower()
          if alphaonly.match(word):
              counter[word] += 1

或者，在从表中提取最不频繁的单词时，是否只想跳过数字
words_by_freq = sorted(counter.iteritems(),
                       key=lambda(word, count): (count, word))

i = 0
for word, frequency in words_by_freq:
    if alphaonly.match(word):
        i += 1
        sys.stdout.write("{}: {}\n".format(word, frequency))
    if i == number: break

伟大的还将我的答案改为wim建议的较短形式；我认为较长的形式更清晰，但可能它只是我需要消除的一个代码tic。：）对否决票有点困惑，但事实就是如此。
words_by_freq = sorted(counter.iteritems(),
                       key=lambda(word, count): (count, word))

i = 0
for word, frequency in words_by_freq:
    if alphaonly.match(word):
        i += 1
        sys.stdout.write("{}: {}\n".format(word, frequency))
    if i == number: break