Python 从列表派生的数据帧中的字频率总和_Python_Pandas

Python 从列表派生的数据帧中的字频率总和

python pandas

Python 从列表派生的数据帧中的字频率总和,python,pandas,Python,Pandas,我有一列数据，其中包含文本和一个要与文本列匹配的单个单词的列表，并求和这些单词在列的每一行中出现的次数下面是一个例子： wordlist = ['alaska', 'france', 'italy'] test = pd.read_csv('vacation text.csv') test.head(4) Index Text 0 'he's going to alaska and france' 1 'want to go to italy next s

我有一列数据，其中包含文本和一个要与文本列匹配的单个单词的列表，并求和这些单词在列的每一行中出现的次数

下面是一个例子：

wordlist = ['alaska', 'france', 'italy']

test = pd.read_csv('vacation text.csv')
test.head(4)

Index    Text
0        'he's going to alaska and france'
1        'want to go to italy next summer'
2        'germany is great!'
4        'her parents are from france and alaska but she lives in alaska'

我尝试使用以下代码：

test['count'] = pd.Series(test.text.str.count(r).sum() for r in wordlist)

该代码：

test['count'] = pd.Series(test.text.str.contains(r).sum() for r in wordlist)

问题是总和似乎不能准确反映

文本

列中的字数。我注意到这一点，再次使用我的示例，将

德国

添加到我的列表中，然后总和没有从0变为1

最终，我希望我的数据看起来像：

Index    Text                                                     Count
0        'he's going to alaska and france'                          2
1        'want to go to italy next summer'                          1
2        'germany is great!'                                        0
4        'her folks are from france and italy but she lives in alaska'   3

有人知道其他的方法吗？

一种方法是使用

str.count

In [792]: test['Text'].str.count('|'.join(wordlist))
Out[792]:
0    2
1    1
2    0
3    3
Name: Text, dtype: int64

另一种方式是，

sum

单个单词的计数

In [802]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist}).sum(1)
Out[802]:
0    2
1    1
2    0
3    3
dtype: int64

细节

In [804]: '|'.join(wordlist)
Out[804]: 'alaska|france|italy'

In [805]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist})
Out[805]:
   alaska  france  italy
0       1       1      0
1       0       0      1
2       0       0      0
3       2       1      0

一种方法是使用

str.count

In [792]: test['Text'].str.count('|'.join(wordlist))
Out[792]:
0    2
1    1
2    0
3    3
Name: Text, dtype: int64

另一种方式是，

sum

单个单词的计数

In [802]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist}).sum(1)
Out[802]:
0    2
1    1
2    0
3    3
dtype: int64

细节

In [804]: '|'.join(wordlist)
Out[804]: 'alaska|france|italy'

In [805]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist})
Out[805]:
   alaska  france  italy
0       1       1      0
1       0       0      1
2       0       0      0
3       2       1      0

第一种方法和第二种方法之间是否存在显著的时间差？当然，稍后会比较慢，您应该为您的用例进行基准测试。第一种方法和第二种方法之间是否存在显著的时间差？当然，稍后会比较慢，您应该为您的用例进行基准测试。