在python中有效地计算单词频率
我想计算文本文件中所有单词的频率在python中有效地计算单词频率,python,nlp,scikit-learn,word-count,frequency-distribution,Python,Nlp,Scikit Learn,Word Count,Frequency Distribution,我想计算文本文件中所有单词的频率 >>> countInFile('test.txt') 如果目标文本文件类似于: # test.txt aaa bbb ccc bbb 我已经用纯python实现了它。然而,我发现纯粹的python方法是不够的,因为文件太大(>1GB) 我认为借用sklearn的力量是一个候选人 若你们让CountVectorizer计算每一行的频率,我想你们将通过对每一列求和得到单词频率。但是,这听起来有点间接 使用python计算文件中单词的最有效和最
>>> countInFile('test.txt')
如果目标文本文件类似于:
# test.txt
aaa bbb ccc
bbb
我已经用纯python实现了它。然而,我发现纯粹的python方法是不够的,因为文件太大(>1GB)
我认为借用sklearn的力量是一个候选人
若你们让CountVectorizer计算每一行的频率,我想你们将通过对每一列求和得到单词频率。但是,这听起来有点间接
使用python计算文件中单词的最有效和最直接的方法是什么
更新
我的(非常慢)代码如下:
from collections import Counter
def get_term_frequency_in_file(source_file_path):
wordcount = {}
with open(source_file_path) as f:
for line in f:
line = line.lower().translate(None, string.punctuation)
this_wordcount = Counter(line.split())
wordcount = add_merge_two_dict(wordcount, this_wordcount)
return wordcount
def add_merge_two_dict(x, y):
return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }
跳过CountVectorier和scikit学习 文件可能太大,无法加载到内存中,但我怀疑python字典是否太大。最简单的选择可能是将大文件拆分为10-20个较小的文件,并扩展代码以循环较小的文件。这就足够了
def countinfile(filename):
d = {}
with open(filename, "r") as fin:
for line in fin:
words = line.strip().split()
for word in words:
try:
d[word] += 1
except KeyError:
d[word] = 1
return d
最简洁的方法是使用Python提供的工具
from future_builtins import map # Only on Python 2
from collections import Counter
from itertools import chain
def countInFile(filename):
with open(filename) as f:
return Counter(chain.from_iterable(map(str.split, f)))
就这样<代码>映射(str.split,f)正在生成一个生成器,该生成器返回每行的列表
s个单词。在链中包装。from_iterable
将其转换为一个单独的生成器,一次生成一个单词<代码>计数器接受一个输入iterable并计算其中所有唯一值。最后,您返回一个类似dict的对象(一个计数器
),该对象存储所有唯一的单词及其计数,并且在创建过程中,一次只存储一行数据和总计数,而不是一次存储整个文件
从理论上讲,在Python2.7和3.1上,您可以自己使用dict
或collections.defaultdict(int)
进行计数(因为Counter
是在Python中实现的,这在某些情况下会使其速度变慢),但是让计数器
做这项工作更简单,也更自我记录(我的意思是,整个目标是计数,所以使用计数器
)。除此之外,在CPython(参考解释器)3.2及更高版本上,计数器
有一个C级加速器,用于计算可输入的数据,它的运行速度比纯Python编写的任何东西都要快
更新:您似乎想要去除标点符号和不区分大小写,因此这里是我以前代码的一个变体:
from string import punctuation
def countInFile(filename):
with open(filename) as f:
linewords = (line.translate(None, punctuation).lower().split() for line in f)
return Counter(chain.from_iterable(linewords))
您的代码运行速度要慢得多,因为它正在创建和销毁许多小型的计数器
和集合
对象,而不是。更新
-每行使用一个计数器
(虽然比我在更新的代码块中给出的稍微慢一点,但至少在算法上比例因子类似).一个高效准确的方法是利用
- scikit中的计数矢量器(用于ngram提取)
- 用于
word\u标记化的NLTK
numpy
收集计数的矩阵求和
收集。计数器
用于收集计数和词汇
例如:
import urllib.request
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))
# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())
# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
[out]:
[(',', 32000),
('.', 17783),
('de', 11225),
('a', 7197),
('que', 5710),
('la', 4732),
('je', 4304),
('se', 4013),
('на', 3978),
('na', 3834)]
5.257147789001465
38.306814909
24.8241138458
12.1182529926
基本上,您也可以这样做:
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
def freq_dist(data):
"""
:param data: A string with sentences separated by '\n'
:type data: str
"""
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
X = ngram_vectorizer.fit_transform(data.split('\n'))
vocab = list(ngram_vectorizer.get_feature_names())
counts = X.sum(axis=0).A1
return Counter(dict(zip(vocab, counts)))
让我们timeit
:
import time
start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)
[out]:
[(',', 32000),
('.', 17783),
('de', 11225),
('a', 7197),
('que', 5710),
('la', 4732),
('je', 4304),
('se', 4013),
('на', 3978),
('na', 3834)]
5.257147789001465
38.306814909
24.8241138458
12.1182529926
注意,也可以使用文件而不是字符串,这里不需要将整个文件读入内存。代码:
import io
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
infile = '/path/to/input.txt'
ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)
with io.open(infile, 'r', encoding='utf8') as fin:
X = ngram_vectorizer.fit_transform(fin)
vocab = ngram_vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
我处理二进制数据,而不是解码从url读取的整个字节。因为bytes.translate
希望它的第二个参数是字节字符串,所以I utf-8编码标点符号。删除标点后,我使用utf-8对字节字符串进行解码
函数freq\u dist
需要一个iterable。这就是我传递data.splitlines()
的原因
from urllib2 import urlopen
from collections import Counter
from string import punctuation
from time import time
import sys
from pprint import pprint
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
data = urlopen(url).read()
def freq_dist(data):
"""
:param data: file-like object opened in binary mode or
sequence of byte strings separated by '\n'
:type data: an iterable sequence
"""
#For readability
#return Counter(word for line in data
# for word in line.translate(
# None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())
punc = punctuation.encode('utf-8')
words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
return Counter(words)
start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(word_dist.most_common(10))
产出
elapsed: 0.806480884552
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
elapsed: 0.642680168152
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
似乎dict
比Counter
对象更有效
def freq_dist(data):
"""
:param data: A string with sentences separated by '\n'
:type data: str
"""
d = {}
punc = punctuation.encode('utf-8')
words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
for word in words:
d[word] = d.get(word, 0) + 1
return d
start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])
产出
elapsed: 0.806480884552
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
elapsed: 0.642680168152
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
为了在打开大文件时提高内存效率,您必须只传递打开的url。但时间安排也包括文件下载时间
data = urlopen(url)
word_dist = freq_dist(data)
这里有一些基准。这看起来很奇怪,但最粗糙的代码会赢
[代码]:
from collections import Counter, defaultdict
import io, time
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
infile = '/path/to/file'
def extract_dictionary_sklearn(file_path):
with io.open(file_path, 'r', encoding='utf8') as fin:
ngram_vectorizer = CountVectorizer(analyzer='word')
X = ngram_vectorizer.fit_transform(fin)
vocab = ngram_vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
return Counter(dict(zip(vocab, counts)))
def extract_dictionary_native(file_path):
dictionary = Counter()
with io.open(file_path, 'r', encoding='utf8') as fin:
for line in fin:
dictionary.update(line.split())
return dictionary
def extract_dictionary_paddle(file_path):
dictionary = defaultdict(int)
with io.open(file_path, 'r', encoding='utf8') as fin:
for line in fin:
for words in line.split():
dictionary[word] +=1
return dictionary
start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start
start = time.time()
extract_dictionary_native(infile)
print time.time() - start
start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start
[out]:
[(',', 32000),
('.', 17783),
('de', 11225),
('a', 7197),
('que', 5710),
('la', 4732),
('je', 4304),
('se', 4013),
('на', 3978),
('na', 3834)]
5.257147789001465
38.306814909
24.8241138458
12.1182529926
上述基准中使用的数据大小(154MB):
$ wc -c /path/to/file
161680851
$ wc -l /path/to/file
2176141
需要注意的一些事项:
- 对于
sklearn
版本,矢量器创建+numpy操作和转换为计数器
对象的开销很大
- 然后本机
计数器
更新版本,似乎计数器.update()
是一个昂贵的操作
您可以尝试使用sklearn
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data=['i am student','the student suffers a lot']
transformed_data =vectorizer.fit_transform(data)
vocab= {a: b for a, b in zip(vectorizer.get_feature_names(), np.ravel(transformed_data.sum(axis=0)))}
print (vocab)
结合其他人的观点和我自己的观点:)
这是我给你的
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
text='''Note that if you use RegexpTokenizer option, you lose
natural language features special to word_tokenize
like splitting apart contractions. You can naively
split on the regex \w+ without any need for the NLTK.
'''
# tokenize
raw = ' '.join(word_tokenize(text.lower()))
tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)
# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]
# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common
输出
(全部)
[(“注”,1),
('use',1),
('regexptokenizer',1),
(“选项”,1),
('lose',1),
(“自然”,1),
(“语言”,1),
(“特征”,1),
(“特殊”,1),
(“单词”,1),
(“标记化”,1),
('like',1),
(“分裂”,1),
(‘分开’,1),
(‘收缩’,1),
(“天真地”,1),
(“拆分”,1),
('regex',1),
(“无”,1),
(“需要”,1)]
在效率方面可以做得更好,但如果您不太担心的话,这段代码是最好的。在python中拆分单词将不得不为列表分配内存,并创建许多str对象,还有字典创建,python哈希不是很快。为了获得最大性能,您可以编写C扩展名,在不复制内存的情况下查找单词边界,然后使用最快的哈希对其进行计数,完成后,创建python dict。您是匹配某些单词,还是尝试对每个唯一的“单词”进行计数。您希望在1 GB大小的文件中找到多少唯一的单词?另外,平均每行有多长?切换到C或某个模块可能无法在执行时间上提高那么多(在950M的数据集上进行基本Python测试需要25秒,这并不慢)。问题是它将所有单词都存储在内存中(因此您需要至少1G的可用内存)。如果您的数据限制为1G,这就是pr