Python 如何分组并获得每组最常用的单词和大字组
我目前正在处理这样的数据帧:Python 如何分组并获得每组最常用的单词和大字组,python,pandas,Python,Pandas,我目前正在处理这样的数据帧: words: other: category: hello, jim, you, you , jim val1 movie it, seems, bye, limb, pat, paddy val2 movie how, are, you, are , kim val1 television ...
words: other: category:
hello, jim, you, you , jim val1 movie
it, seems, bye, limb, pat, paddy val2 movie
how, are, you, are , kim val1 television
......
......
我试图计算“类别”栏中每个类别的前10个最常出现的单词和双格图。尽管如此,我还是想先计算出最常见的bigram,然后再将它们归为相应的类别
我的问题是,如果我按类别分组,然后得到前10个最常出现的单词,第一行的单词将与第二行合并
Bigram应如下所示:
(hello, jim), (jim, you), (you, you), (you, jim)
(it, seems), (seems,bye), (bye, limb), (limb, pat), (pat, paddy)
(how, are), (are, you), (you, are), (are, kim)
然而,如果我在获得bigrams之前分组,bigrams将是:
(hello, jim), (jim, you), (you, you), (you, jim), (jim, it), (it, seems), (seems,bye), (bye, limb), (limb, pat), (pat, paddy)
(how, are), (are, you), (you, are), (are, kim)
用熊猫做这个最好的方法是什么
抱歉,如果我的问题过于复杂,我只想包括所有细节。请告诉我任何问题。示例数据框:
words other category
0 hello, jim, you, you , jim val1 movie
1 it, seems, bye, limb, pat, hello, jim val2 movie
2 how, are, you, are , kim val1 television
下面是一种使用Pandas和计算bigram的方法。iterrows()
:
以下是使用Pandas和的更有效方法。应用:
def bigram(row):
lst = row['words'].split(', ')
return [(lst[x].strip(), lst[x+1].strip()) for x in range(len(lst)-1)]
bigrams = df.apply(lambda row: bigram(row), axis=1)
print(bigrams.tolist())
然后,您可以按类别对数据进行分组,并找到前10个最常见的bigram。下面是一个按类别查找最常见的Bigram的示例:
df['bigrams'] = bigrams
df2 = df.groupby('category').agg({'bigrams': 'sum'})
# Compute the most frequent bigrams by category
from collections import Counter
df3 = df2.bigrams.apply(lambda row: Counter(row)).to_frame()
print(df3)
bigrams
category
movie {('hello', 'jim'): 2, ('jim', 'you'): 1, ('you...
television {('how', 'are'): 1, ('are', 'you'): 1, ('you',...
按类别排列的双字符频率字典:
df['bigrams'] = bigrams
df2 = df.groupby('category').agg({'bigrams': 'sum'})
# Compute the most frequent bigrams by category
from collections import Counter
df3 = df2.bigrams.apply(lambda row: Counter(row)).to_frame()
print(df3)
bigrams
category
movie {('hello', 'jim'): 2, ('jim', 'you'): 1, ('you...
television {('how', 'are'): 1, ('are', 'you'): 1, ('you',...
您目前是如何获得bigram的?能否再次发布bigram函数?这似乎需要花费大量时间来运行大型数据帧。是否有任何方法可以提高您的解决方案的效率?此外,此解决方案仅打印Bigram,如何将其格式化为每个类别的组?谢谢可能有,这取决于你如何计算二元图。'words:'
列中的每一行的元素数是否完全相同?否,它们的字数不同。好的。如果它们有相同数量的元素,我可以推荐一个更有效的方法。@jackiegirl89我使用apply
添加了一个更有效的方法。
print(df3)
bigrams
category
movie {('hello', 'jim'): 2, ('jim', 'you'): 1, ('you...
television {('how', 'are'): 1, ('are', 'you'): 1, ('you',...
# Filter to just the top 3 most frequent bigrams (or 10 if you have enough data)
df3.bigrams.apply(lambda row: list(row)[0:3])
category
movie [(hello, jim), (jim, you), (you, you)]
television [(how, are), (are, you), (you, are)]
Name: bigrams, dtype: object