Python 如何快速获取语料库中的单词集合(使用nltk)?

Python 如何快速获取语料库中的单词集合(使用nltk)?,python,text,nlp,counter,nltk,Python,Text,Nlp,Counter,Nltk,我想用nltk为语料库快速构建一个单词查找表。以下是我正在做的事情: 读取原始文本:文件=打开(“语料库”,“r”).Read().decode('utf-8')) 使用a=nltk.word\u tokenize(文件)获取所有令牌 使用set(a)获取唯一令牌,并将其转换回列表 这是完成这项任务的正确方法吗 试试看: import time from collections import Counter from nltk import FreqDist from nltk.corpus

我想用nltk为语料库快速构建一个单词查找表。以下是我正在做的事情:

  • 读取原始文本:文件=打开(“语料库”,“r”).Read().decode('utf-8'))
  • 使用a=nltk.word\u tokenize(文件)获取所有令牌
  • 使用set(a)获取唯一令牌,并将其转换回列表
  • 这是完成这项任务的正确方法吗

    试试看:

    import time
    from collections import Counter
    
    from nltk import FreqDist
    from nltk.corpus import brown
    from nltk import word_tokenize
    
    def time_uniq(maxchar):
        # Let's just take the first 10000 characters.
        words = brown.raw()[:maxchar] 
    
        # Time to tokenize
        start = time.time()
        words = word_tokenize(words)
        print time.time() - start
    
        # Using collections.Counter
        start = time.time()
        x = Counter(words)
        uniq_words = x.keys()
        print time.time() - start
    
        # Using nltk.FreqDist
        start = time.time()
        FreqDist(words)
        uniq_words = x.keys()
        print time.time() - start
    
        # If you don't need frequency info, use set()
        start = time.time()
        uniq_words = set(words)
        print time.time() - start
    
    [out]:

    ~$ python test.py 
    0.0413908958435
    0.000495910644531
    0.000432968139648
    9.3936920166e-05
    
    0.10734796524
    0.00458407402039
    0.00439405441284
    0.00084400177002
    
    1.12890005112
    0.0492491722107
    0.0490930080414
    0.0100378990173
    
    要加载您自己的语料库文件(假设您的文件足够小,可以装入RAM):

    如果文件太大,您可能希望一次处理一行文件:

    from collections import Counter
    from nltk import FreqDist, word_tokenize
    
    from nltk.corpus import brown
    
    # Using Counter.
    x = Counter()
    with open('myfile.txt', 'r') as fin:
        for line in fin.split('\n'):
            x.update(word_tokenize(line))
    uniq = x.keys()
    
    # Using Set.
    x = set()
    with open('myfile.txt', 'r') as fin:
        for line in fin.split('\n'):
            x.update(word_tokenize(line))
    uniq = x.keys()
    

    如果有错误,您可以尝试一下,然后返回给我们。样式:最好使用“文本”而不是“文件”,以表明它是文本而不是打开的文件。如果您的文本是英文的,那么使用word_tokenize是可以的,因为,例如,它知道如何分割标准的压缩,而基于python.split的朴素的标记器就做不到这一点
    from collections import Counter
    from nltk import FreqDist, word_tokenize
    
    from nltk.corpus import brown
    
    # Using Counter.
    x = Counter()
    with open('myfile.txt', 'r') as fin:
        for line in fin.split('\n'):
            x.update(word_tokenize(line))
    uniq = x.keys()
    
    # Using Set.
    x = set()
    with open('myfile.txt', 'r') as fin:
        for line in fin.split('\n'):
            x.update(word_tokenize(line))
    uniq = x.keys()