Python 基于不同列中值的交集查找相似组_Python_Pandas

Python 基于不同列中值的交集查找相似组

python pandas

Python 基于不同列中值的交集查找相似组,python,pandas,Python,Pandas,我有一个df，看起来像这样： Group Attribute Cheese Dairy Cheese Food Cheese Curd Cow Dairy Cow Food Cow Animal Cow Hair Cow Stomachs Yogurt Dairy Yogurt Food Yogurt Curd Yogurt Fruity 我想为每个组做的是根据属性的交集找到它最喜欢的组。我想要的结尾形式是： Group Tota

我有一个df，看起来像这样：

Group   Attribute

Cheese  Dairy
Cheese  Food
Cheese  Curd
Cow     Dairy
Cow     Food
Cow     Animal
Cow     Hair
Cow     Stomachs
Yogurt  Dairy
Yogurt  Food
Yogurt  Curd
Yogurt  Fruity

我想为每个组做的是根据属性的交集找到它最喜欢的组。我想要的结尾形式是：

Group   TotalCount   LikeGroup   CommonWords  PCT

Cheese  3            Yogurt      3            100.0
Cow     5            Cheese      2            40.0
Yogurt  4            Cheese      4            75.0

我意识到这可能在一个问题上问了很多问题。我可以做很多，但我真的很难得到属性交叉点的计数，即使只是在一个组和另一个组之间。如果我能找到奶酪和酸奶之间的交叉点，这将使我朝着正确的方向前进

是否可以在数据帧内执行此操作？我可以看到制作几个列表并在所有列表对之间进行交集，然后使用新的列表长度来获得百分比

例如，酸奶：

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cheese = ['Dairy','Food','Curd']

>>>Yogurt_Cheese = len(list(set(Yogurt) & set(Cheese)))/len(Yogurt)
0.75

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cow = ['Dairy','Food','Animal','Hair','Stomachs']

>>>Yogurt_Cow = len(list(set(Yogurt) & set(Cow)))/len(Yogurt)
0.5

>>>max(Yogurt_Cheese,Yogurt_Cow)
0.75

看起来您应该能够制定一个聚合策略来解决这个问题。试着看一下这些编码示例，并思考如何在数据帧上构造键和聚合函数，而不是像示例中所示的那样处理单件邮件

试着在python环境中运行它（它是在Jupyter笔记本中使用python 2.7创建的），看看它是否能让您对代码有一些想法：

np.random.seed(10)    # optional .. makes sure you get same random
                      # numbers used in the original experiment
df = pd.DataFrame({'key1':['a','a','b','b','a'],
                   'key2':['one','two','one','two','one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

df
group = df.groupby('key1')
group2 = df.groupby(['key1', 'key2'])
group2.agg(['count', 'sum', 'min', 'max', 'mean', 'std'])

我创建了我自己的示例数组的较小版本

import pandas as pd 
from itertools import permutations

df = pd.DataFrame(data = [['cheese','dairy'],['cheese','food'],['cheese','curd'],['cow','dairy'],['cow','food'],['yogurt','dairy'],['yogurt','food'],['yogurt','curd'],['yogurt','fruity']], columns = ['Group','Attribute'])
count_dct = df.groupby('Group').count().to_dict() # to get the TotalCount, used later
count_dct = count_dct.values()[0] # gets rid of the attribute key and returns the dictionary embedded in the list.

unique_grp = df['Group'].unique() # get the unique groups 
unique_atr = df['Attribute'].unique() # get the unique attributes

combos = list(permutations(unique_grp, 2)) # get all combinations of the groups
comp_df = pd.DataFrame(data = (combos), columns = ['Group','LikeGroup']) # create the array to put comparison data into
comp_df['CommonWords'] = 0 

for atr in unique_atr:
    temp_df = df[df['Attribute'] == atr] # break dataframe into pieces that only contain the attribute being looked at during that iteration

    myl = list(permutations(temp_df['Group'],2)) # returns the pairs that have the attribute in common as a tuple
    for comb in myl:
        comp_df.loc[(comp_df['Group'] == comb[0]) & (comp_df['LikeGroup'] == comb[1]), 'CommonWords'] += 1 # increments the CommonWords column where the Group column is equal to the first entry in the previously mentioned tuple, and the LikeGroup column is equal to the second entry.

for key, val in count_dct.iteritems(): # put the previously computed TotalCount into the comparison dataframe
    comp_df.loc[comp_df['Group'] == key, 'TotalCount'] = val

comp_df['PCT'] = (comp_df['CommonWords'] * 100.0 / comp_df['TotalCount']).round()

对于我的示例数据，我得到了输出

    Group LikeGroup  CommonWords  TotalCount  PCT
0  cheese       cow            2           3   67
1  cheese    yogurt            3           3  100
2     cow    cheese            2           2  100
3     cow    yogurt            2           2  100
4  yogurt    cheese            3           4   75
5  yogurt       cow            2           4   50

这似乎是正确的。

这显示了所有组的常用词百分比，但我可以很容易地从这里开始，我认为这可能比我要求的更有用。非常感谢，没问题。如果有人有类似的问题，你应该接受答案；）对于Python3:

count\u dct

将给出一个错误。改用

count\u dct=df.groupby（'Group'）.size（）。