Python 如何快速获取语料库中的单词集合(使用nltk)?
我想用nltk为语料库快速构建一个单词查找表。以下是我正在做的事情:Python 如何快速获取语料库中的单词集合(使用nltk)?,python,text,nlp,counter,nltk,Python,Text,Nlp,Counter,Nltk,我想用nltk为语料库快速构建一个单词查找表。以下是我正在做的事情: 读取原始文本:文件=打开(“语料库”,“r”).Read().decode('utf-8')) 使用a=nltk.word\u tokenize(文件)获取所有令牌 使用set(a)获取唯一令牌,并将其转换回列表 这是完成这项任务的正确方法吗 试试看: import time from collections import Counter from nltk import FreqDist from nltk.corpus
import time
from collections import Counter
from nltk import FreqDist
from nltk.corpus import brown
from nltk import word_tokenize
def time_uniq(maxchar):
# Let's just take the first 10000 characters.
words = brown.raw()[:maxchar]
# Time to tokenize
start = time.time()
words = word_tokenize(words)
print time.time() - start
# Using collections.Counter
start = time.time()
x = Counter(words)
uniq_words = x.keys()
print time.time() - start
# Using nltk.FreqDist
start = time.time()
FreqDist(words)
uniq_words = x.keys()
print time.time() - start
# If you don't need frequency info, use set()
start = time.time()
uniq_words = set(words)
print time.time() - start
[out]:
~$ python test.py
0.0413908958435
0.000495910644531
0.000432968139648
9.3936920166e-05
0.10734796524
0.00458407402039
0.00439405441284
0.00084400177002
1.12890005112
0.0492491722107
0.0490930080414
0.0100378990173
要加载您自己的语料库文件(假设您的文件足够小,可以装入RAM):
如果文件太大,您可能希望一次处理一行文件:
from collections import Counter
from nltk import FreqDist, word_tokenize
from nltk.corpus import brown
# Using Counter.
x = Counter()
with open('myfile.txt', 'r') as fin:
for line in fin.split('\n'):
x.update(word_tokenize(line))
uniq = x.keys()
# Using Set.
x = set()
with open('myfile.txt', 'r') as fin:
for line in fin.split('\n'):
x.update(word_tokenize(line))
uniq = x.keys()
如果有错误,您可以尝试一下,然后返回给我们。样式:最好使用“文本”而不是“文件”,以表明它是文本而不是打开的文件。如果您的文本是英文的,那么使用word_tokenize是可以的,因为,例如,它知道如何分割标准的压缩,而基于python.split的朴素的标记器就做不到这一点
from collections import Counter
from nltk import FreqDist, word_tokenize
from nltk.corpus import brown
# Using Counter.
x = Counter()
with open('myfile.txt', 'r') as fin:
for line in fin.split('\n'):
x.update(word_tokenize(line))
uniq = x.keys()
# Using Set.
x = set()
with open('myfile.txt', 'r') as fin:
for line in fin.split('\n'):
x.update(word_tokenize(line))
uniq = x.keys()