Python NLTK语料库预处理

Python NLTK语料库预处理,python,nltk,Python,Nltk,我试图从语料库中删除较长(>25个标记)和较短(你可以这样做,只保留长度小于26且长度大于3的单词 a = ["hello world", "how are you doing","where are you going?", "welcome to the greatest show on earth! How will you manage to gain all the experience needed for thi

我试图从语料库中删除较长(>25个标记)和较短(你可以这样做,只保留长度
小于26
且长度
大于3
的单词

a = ["hello world", "how are you doing","where are you going?", "welcome to the greatest show on earth! How will you manage to gain all the experience needed for this to show?","hi"]
[len(w) for w in a]
>>>[11, 17, 20, 110,2]
方法1:
list(过滤器(lambda x:4>[“你好,世界”,“你好吗”,“你要去哪里?”)
方法2:
[If4中的x代表x>['hello world'、'你好'、'你要去哪里?']

len(w)>=25和len(w)我想你的意思是
lens=[w代表语料库中的w.sents()如果4@ForceBru哦好的,那怎么做?单独做?还有如何包含少于8次的稀有单词?@yudhiesh是
425
,尽管第二次我又得到一个空列表。@jay.andrea
4>len(w)>25是指w的长度大于4和25,这是不可能的。
4
out: []
a = ["hello world", "how are you doing","where are you going?", "welcome to the greatest show on earth! How will you manage to gain all the experience needed for this to show?","hi"]
[len(w) for w in a]
>>>[11, 17, 20, 110,2]
list(filter(lambda x: 4 <= len(x) <= 25, a))
>>>['hello world', 'how are you doing', 'where are you going?']
[x for x in a if 4 <= len(x) <= 25]
>>>['hello world', 'how are you doing', 'where are you going?']