Python NLTK语料库预处理
我试图从语料库中删除较长(>25个标记)和较短(你可以这样做,只保留长度Python NLTK语料库预处理,python,nltk,Python,Nltk,我试图从语料库中删除较长(>25个标记)和较短(你可以这样做,只保留长度小于26且长度大于3的单词 a = ["hello world", "how are you doing","where are you going?", "welcome to the greatest show on earth! How will you manage to gain all the experience needed for thi
小于26
且长度大于3
的单词
a = ["hello world", "how are you doing","where are you going?", "welcome to the greatest show on earth! How will you manage to gain all the experience needed for this to show?","hi"]
[len(w) for w in a]
>>>[11, 17, 20, 110,2]
方法1:
list(过滤器(lambda x:4>[“你好,世界”,“你好吗”,“你要去哪里?”)
方法2:
[If4中的x代表x>['hello world'、'你好'、'你要去哪里?']
len(w)>=25和len(w)我想你的意思是lens=[w代表语料库中的w.sents()如果4@ForceBru哦好的,那怎么做?单独做?还有如何包含少于8次的稀有单词?@yudhiesh是425
,尽管第二次我又得到一个空列表。@jay.andrea4>len(w)>25是指w的长度大于4和25,这是不可能的。4
out: []
a = ["hello world", "how are you doing","where are you going?", "welcome to the greatest show on earth! How will you manage to gain all the experience needed for this to show?","hi"]
[len(w) for w in a]
>>>[11, 17, 20, 110,2]
list(filter(lambda x: 4 <= len(x) <= 25, a))
>>>['hello world', 'how are you doing', 'where are you going?']
[x for x in a if 4 <= len(x) <= 25]
>>>['hello world', 'how are you doing', 'where are you going?']