Python 获取包含字符串列表的列的词频_Python_Pandas_Dataframe

Python 获取包含字符串列表的列的词频

python pandas dataframe

Python 获取包含字符串列表的列的词频,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框： import pandas as pd test = pd.DataFrame({'words':[['foo','bar none','scare','bar','foo'], ['race','bar none','scare'], ['ten','scare','crow bird']]}) 我试图获得dataframe colunn中所有列表元素的单

我有一个数据框：

import pandas as pd
test = pd.DataFrame({'words':[['foo','bar none','scare','bar','foo'],
                              ['race','bar none','scare'],
                              ['ten','scare','crow bird']]})

我试图获得dataframe colunn中所有列表元素的单词/短语计数。我目前的解决办法是：

allwords = []

for index, row in test.iterrows():
    for word in row['words']:
        allwords.append(word)

这是可行的，但我想知道是否有更快的解决方案。注意：我没有使用

'.join（）

，因为我不想将短语拆分为单个单词。

为了提高性能，请不要使用

iterrows

：

from collections import Counter
from  itertools import chain

a = pd.Series(Counter(chain.from_iterable(test['words']))).sort_values(ascending=False)
print (a)
scare        3
foo          2
bar none     2
bar          1
race         1
ten          1
crow bird    1
dtype: int64

唯一的解决方案：

a = pd.Series([y for x in test['words'] for y in x]).value_counts()
print (a)
scare        3
bar none     2
foo          2
bar          1
race         1
crow bird    1
ten          1
dtype: int64

让我们用试试：

pd.value_counts(np.hstack(test['words']))

尝试使用

计数器

：

import collections
words = test['words'].tolist()

collections.Counter([x for sublist in words for x in sublist])

scare        3
foo          2
bar none     2
ten          1
bar          1
crow bird    1
race         1
dtype: int64

import collections
words = test['words'].tolist()

collections.Counter([x for sublist in words for x in sublist])

Counter({'foo': 2,
         'bar none': 2,
         'scare': 3,
         'bar': 1,
         'race': 1,
         'ten': 1,
         'crow bird': 1})