Python 检索最小频率为5的所有单词_Python_Nltk

Python 检索最小频率为5的所有单词

python

Python 检索最小频率为5的所有单词,python,nltk,Python,Nltk,我想用NLTK检索最小频率为5的所有单词，并将它们存储在变量中以备将来处理。在NLTK书中找不到任何内容。提前谢谢编辑：我正在使用此代码，希望过滤掉出现次数不超过5次的单词 import os import glob from nltk.tokenize import RegexpTokenizer from nltk.corpus import stopwords def create(): read_files = glob.glob("D:\\test\\text\\*.txt

我想用NLTK检索最小频率为5的所有单词，并将它们存储在变量中以备将来处理。在NLTK书中找不到任何内容。提前谢谢

编辑：我正在使用此代码，希望过滤掉出现次数不超过5次的单词

import os
import glob
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

def create():
    read_files = glob.glob("D:\\test\\text\\*.txt")
    with open("D:\\test\\temp.txt", "wb") as outfile:
        for f in read_files:
            with open(f, "rb") as infile:
                outfile.write(infile.read())    

def modify():
    tokenizer = RegexpTokenizer("[\w']+")
    english_stops = set(stopwords.words('english'))
    f = open('D:\\test\\temp.txt')
    out = open('D:\\test\\result.txt', 'w')
    raw = f.read()
    a = tokenizer.tokenize(raw)
    a = [word.lower() for word in a if word not in english_stops]
    a = list(set(a))
    print(a, file=out)

def remove():
    os.remove("D:\\test\\temp.txt")

if __name__ == '__main__':
    create()
    modify()
    remove()

使用功能

FreqDist

获取根据您的信条过滤它们的频率：

实验结果的频率分布。A. 频率分布记录每个结果的次数已经进行了一项实验。例如，频率分布可以用来记录每个单词类型的频率文件

下面是一个如何使用它的示例：

>>> import nltk
>>> from nltk import FreqDist
>>> sentence='''This is my sentence is heloo is heloo my my my my'''
>>> tokens = nltk.word_tokenize(sentence)
>>> fdist=FreqDist(tokens)

最后，我们得到了一个单词列表及其频率，现在您应该根据您的条件过滤它们

f（w）>=5

，使用

filter

功能：

过滤器（功能，可编辑）

从这些元素构造一个迭代器函数返回true的iterable的。iterable可以是序列、支持迭代的容器或迭代器