Python 如何分组并获得每组最常用的单词和大字组

Python 如何分组并获得每组最常用的单词和大字组,python,pandas,Python,Pandas,我目前正在处理这样的数据帧: words: other: category: hello, jim, you, you , jim val1 movie it, seems, bye, limb, pat, paddy val2 movie how, are, you, are , kim val1 television ...

我目前正在处理这样的数据帧:

 words:                               other:   category:    
 hello, jim, you, you , jim            val1      movie
 it, seems, bye, limb, pat, paddy      val2      movie
 how, are, you, are , kim              val1      television
 ......
 ......
我试图计算“类别”栏中每个类别的前10个最常出现的单词和双格图。尽管如此,我还是想先计算出最常见的bigram,然后再将它们归为相应的类别

我的问题是,如果我按类别分组,然后得到前10个最常出现的单词,第一行的单词将与第二行合并

Bigram应如下所示:

 (hello, jim), (jim, you), (you, you), (you, jim)
 (it, seems), (seems,bye), (bye, limb), (limb, pat), (pat, paddy)
 (how, are), (are, you), (you, are), (are, kim)
然而,如果我在获得bigrams之前分组,bigrams将是:

 (hello, jim), (jim, you), (you, you), (you, jim), (jim, it), (it, seems), (seems,bye), (bye, limb), (limb, pat), (pat, paddy)
 (how, are), (are, you), (you, are), (are, kim)
用熊猫做这个最好的方法是什么

抱歉,如果我的问题过于复杂,我只想包括所有细节。请告诉我任何问题。

示例数据框:

                                   words other    category
0             hello, jim, you, you , jim  val1       movie
1  it, seems, bye, limb, pat, hello, jim  val2       movie
2               how, are, you, are , kim  val1  television
下面是一种使用Pandas和
计算bigram的方法。iterrows()

以下是使用Pandas和
的更有效方法。应用

def bigram(row):
    lst = row['words'].split(', ')
    return [(lst[x].strip(), lst[x+1].strip()) for x in range(len(lst)-1)]

bigrams = df.apply(lambda row: bigram(row), axis=1)

print(bigrams.tolist())
然后,您可以按类别对数据进行分组,并找到前10个最常见的bigram。下面是一个按类别查找最常见的Bigram的示例:

df['bigrams'] = bigrams
df2 = df.groupby('category').agg({'bigrams': 'sum'})

# Compute the most frequent bigrams by category
from collections import Counter
df3 = df2.bigrams.apply(lambda row: Counter(row)).to_frame()
print(df3)

                                                      bigrams
category                                                     
movie       {('hello', 'jim'): 2, ('jim', 'you'): 1, ('you...
television  {('how', 'are'): 1, ('are', 'you'): 1, ('you',...
按类别排列的双字符频率字典:

df['bigrams'] = bigrams
df2 = df.groupby('category').agg({'bigrams': 'sum'})

# Compute the most frequent bigrams by category
from collections import Counter
df3 = df2.bigrams.apply(lambda row: Counter(row)).to_frame()
print(df3)

                                                      bigrams
category                                                     
movie       {('hello', 'jim'): 2, ('jim', 'you'): 1, ('you...
television  {('how', 'are'): 1, ('are', 'you'): 1, ('you',...

您目前是如何获得bigram的?能否再次发布bigram函数?这似乎需要花费大量时间来运行大型数据帧。是否有任何方法可以提高您的解决方案的效率?此外,此解决方案仅打印Bigram,如何将其格式化为每个类别的组?谢谢可能有,这取决于你如何计算二元图。
'words:'
列中的每一行的元素数是否完全相同?否,它们的字数不同。好的。如果它们有相同数量的元素,我可以推荐一个更有效的方法。@jackiegirl89我使用
apply
添加了一个更有效的方法。
print(df3)

                                                      bigrams
category                                                     
movie       {('hello', 'jim'): 2, ('jim', 'you'): 1, ('you...
television  {('how', 'are'): 1, ('are', 'you'): 1, ('you',...
# Filter to just the top 3 most frequent bigrams (or 10 if you have enough data)
df3.bigrams.apply(lambda row: list(row)[0:3])
category
movie         [(hello, jim), (jim, you), (you, you)]
television      [(how, are), (are, you), (you, are)]
Name: bigrams, dtype: object