pythonic方法计算列表/集合中的单词在数据帧列中出现的次数_Python_Pandas_Dataframe_Count_Find Occurrences

pythonic方法计算列表/集合中的单词在数据帧列中出现的次数

python pandas dataframe

pythonic方法计算列表/集合中的单词在数据帧列中出现的次数,python,pandas,dataframe,count,find-occurrences,Python,Pandas,Dataframe,Count,Find Occurrences,给定一个列表/一组标签 labels = {'rectangle', 'square', 'triangle', 'cube'} 和一个数据帧df df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text']) 我想知道我的标签集中的每个单词在dataframe的文本列中出现了多少次，并创建一个新列，该列包含重复

给定一个列表/一组标签

labels = {'rectangle', 'square', 'triangle', 'cube'}

和一个数据帧df

df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text'])

我想知道我的标签集中的每个单词在dataframe的文本列中出现了多少次，并创建一个新列，该列包含重复次数最多的X个（可能是2或3个）单词。如果两个单词重复的次数相等，则它们可以出现在列表或字符串中

输出：

pd.DataFrame({'text' : ['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], 'best_labels' : [{'rectangle' : 2, 'square' : 1, 'cube' : 1}, {'triangle' : 1, 'circle' : 1}, np.nan]})                                                                                                                          
                                                                                                                      
df['best_labels'] = some_function(df.text)

印刷品：

                                    text                               best_labels
0  rectangle rectangle in my square cube  {'rectangle': 2, 'square': 1, 'cube': 1}
1               triangle circle not here                           {'triangle': 1}
2                           nothing here                                       NaN

另一种可视化数据的方法是使用矩阵：

(df['text'].str.extractall(r'\b({})\b'.format('|'.join(labels)))
           .groupby(level=0)[0]
           .value_counts()
           .unstack()
           .reindex(df.index)
           .rename_axis(None, axis=1))

   cube  rectangle  square  triangle
0   1.0        2.0     1.0       NaN
1   NaN        NaN     NaN       1.0
2   NaN        NaN     NaN       NaN

其思想是从

标签中指定的行中提取文本，然后找出每个句子出现的次数
这是什么样子的？是的，你猜到了，是一个稀疏矩阵。
是pd.DataFrame（{'text'：['rectangle rectangle in my square cube'，'triangle circle not here'，'nothing here'，'best_label'：[{'rectangle'：2，'square'：1，'cube'：1}，{'triangle'：1}，np.nan]}）
你有的东西，或者这是预期输出的一部分？为什么不在best_labels
中为没有匹配的情况保留一个空集呢np.nan（“非数字”）是一个奇怪的“默认”值，在这里使用，因为没有任何有效值是数字。
(df['text'].str.extractall(r'\b({})\b'.format('|'.join(labels)))
           .groupby(level=0)[0]
           .value_counts()
           .unstack()
           .reindex(df.index)
           .rename_axis(None, axis=1))

   cube  rectangle  square  triangle
0   1.0        2.0     1.0       NaN
1   NaN        NaN     NaN       1.0
2   NaN        NaN     NaN       NaN