使用Python高效地实现多个列表中的单词计数_Python_Performance_Python 3.x_Dataframe_Information Retrieval

使用Python高效地实现多个列表中的单词计数

python performance python-3.x dataframe

使用Python高效地实现多个列表中的单词计数,python,performance,python-3.x,dataframe,information-retrieval,Python,Performance,Python 3.x,Dataframe,Information Retrieval,我有以下格式的评论列表： Comments=[['hello world'], ['would', 'hard', 'press'],['find', 'place', 'less'']] wordset={'hello','world','hard','would','press','find','place','less'} 我希望有一个表或数据框，其中有单词集作为索引，并在评论中每个评论的个别计数我使用以下代码实现了所需的数据帧。而且这需要很长时间，我需要一个高效的实现。由于语料库很

我有以下格式的评论列表：

Comments=[['hello world'], ['would', 'hard', 'press'],['find', 'place', 'less'']]

wordset={'hello','world','hard','would','press','find','place','less'}

我希望有一个表或数据框，其中有单词集作为索引，并在评论中每个评论的个别计数

我使用以下代码实现了所需的数据帧。而且这需要很长时间，我需要一个高效的实现。由于语料库很大，这对我们的排名算法的效率有很大的影响

result=pd.DataFrame()
        for comment in Comments:
            worddict_terms=dict.fromkeys(wordset,0)
            for items in comment:
                worddict_terms[items]+=1
                df_comment=pd.DataFrame.from_dict([worddict_terms])
            frames=[result,df_comment]        
            result = pd.concat(frames)

Comments_raw_terms=result.transpose()

我们预期的结果是：

        0   1   2
hello   1   0   0
world   1   0   0
would   0   1   0
press   0   1   0
find    0   0   1
place   0   0   1
less    0   0   1
hard    0   1   0

尝试以下方法：

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

text = pd.Series(Comments).str.join(' ')
X = vect.fit_transform(text)

r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())

结果:

In [49]: r
Out[49]:
   find  hard  hello  less  place  press  world  would
0     0     0      1     0      0      0      1      0
1     0     1      0     0      0      1      0      1
2     1     0      0     1      1      0      0      0

In [50]: r.T
Out[50]:
       0  1  2
find   0  0  1
hard   0  1  0
hello  1  0  0
less   0  0  1
place  0  0  1
press  0  1  0
world  1  0  0
would  0  1  0

纯熊猫解决方案：

In [61]: pd.get_dummies(text.str.split(expand=True), prefix_sep='', prefix='')
Out[61]:
   find  hello  would  hard  place  world  less  press
0     0      1      0     0      0      1     0      0
1     0      0      1     1      0      0     0      1
2     1      0      0     0      1      0     1      0

我认为您的嵌套for循环正在增加复杂性。我正在编写代码，用单个映射函数替换循环的2。我只写了一部分代码，对于注释中的每一条注释，都会得到“Hello”和“World”的计数字典。您，请复制剩余的熊猫制表代码

from collections import Counter import funcy from funcy import project def fun(comment): wordset={'hello','world'} temp_dict_comment = Counter(comment) temp_dict_comment = dict(temp_dict_comment) final_dict = project(temp_dict_comment,wordset) print final_dict Comments=[['hello', 'world'], ['would', 'hard', 'press'],['find', 'place', 'less', 'excitingit', 'wors', 'watch', 'paint', 'dri']] map(fun,Comments)

这应该会有帮助，因为它只包含单个映射，而不是2个for循环。
你能发布你想要的数据集吗？我刚刚添加了预期的结果。请参阅。关于效率的注意事项：一个简单的
defaultfict（int）
代码出于任何原因都比
计数器快