Python 大熊猫的复杂级联分组

Python 大熊猫的复杂级联分组,python,pandas,Python,Pandas,我有一个奇怪的问题,我正试图解决熊猫。假设我有一堆对象,它们有不同的方式来分组。以下是我们的数据框架的外观: df=pd.DataFrame([ {'obj': 'Ball', 'group1_id': None, 'group2_id': '7' }, {'obj': 'Balloon', 'group1_id': '92', 'group2_id': '7' }, {'obj': 'Person', 'group1_id': '14', 'group2_id'

我有一个奇怪的问题,我正试图解决熊猫。假设我有一堆对象,它们有不同的方式来分组。以下是我们的数据框架的外观:

df=pd.DataFrame([
    {'obj': 'Ball',    'group1_id': None, 'group2_id': '7' },
    {'obj': 'Balloon', 'group1_id': '92', 'group2_id': '7' },
    {'obj': 'Person',  'group1_id': '14', 'group2_id': '11'},
    {'obj': 'Bottle',  'group1_id': '3',  'group2_id': '7' },
    {'obj': 'Thought', 'group1_id': '3',  'group2_id': None},
])


obj       group1_id          group2_id
Ball      None               7
Balloon   92                 7
Person    14                 11
Bottle    3                  7
Thought   3                  None
我想根据任何一个组将事情分组。这里有注释:

obj       group1_id          group2_id    # annotated
Ball      None               7            #                   group2_id = 7
Balloon   92                 7            # group1_id = 92 OR group2_id = 7
Person    14                 11           # group1_id = 14 OR group2_id = 11
Bottle    3                  7            # group1_id =  3 OR group2_id = 7
Thought   3                  None         # group1_id = 3
组合后,我们的输出应如下所示:

count         objs                               composite_id
4             [Ball, Balloon, Bottle, Thought]   g1=3,92|g2=7
1             [Person]                           g1=11|g2=14
注意,我们可以根据group2_id=7得到的前三个对象,然后是第四个,think,因为它可以通过组1_id=3与另一个项目匹配,而组1_id=7 id分配给它。注意:对于这个问题,假设一个项目只在一个组合组中,并且永远不会有可能在两个组中的情况


我怎样才能在熊猫身上做到这一点呢?

这一点也不奇怪~网络问题

import networkx as nx
#we need to handle the miss value first , we fill it with same row, so that we did not calssed them into wrong group
df['key1']=df['group1_id'].fillna(df['group2_id'])
df['key2']=df['group2_id'].fillna(df['group1_id'])
# here we start to create the network
G=nx.from_pandas_edgelist(df, 'key1', 'key2')
l=list(nx.connected_components(G))
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
d={k: v for d in L for k, v in d.items()}
# we using above dict to map the same group into the same one in order to groupby them 
out=df.groupby(df.key1.map(d)).agg(objs = ('obj',list) , Count = ('obj','count'), g1= ('group1_id', lambda x : set(x[x.notnull()].tolist())), g2= ('group2_id',  lambda x : set(x[x.notnull()].tolist())))
# notice here I did not conver the composite id into string format , I keep them into different columns which more easy to understand 
Out[53]: 
                                  objs  Count       g1    g2
key1                                                        
0     [Ball, Balloon, Bottle, Thought]      4  {92, 3}   {7}
1                             [Person]      1     {14}  {11}

PS:如果您需要有关网络步骤的更多详细信息,请检查

这一点也不奇怪~网络问题

import networkx as nx
#we need to handle the miss value first , we fill it with same row, so that we did not calssed them into wrong group
df['key1']=df['group1_id'].fillna(df['group2_id'])
df['key2']=df['group2_id'].fillna(df['group1_id'])
# here we start to create the network
G=nx.from_pandas_edgelist(df, 'key1', 'key2')
l=list(nx.connected_components(G))
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
d={k: v for d in L for k, v in d.items()}
# we using above dict to map the same group into the same one in order to groupby them 
out=df.groupby(df.key1.map(d)).agg(objs = ('obj',list) , Count = ('obj','count'), g1= ('group1_id', lambda x : set(x[x.notnull()].tolist())), g2= ('group2_id',  lambda x : set(x[x.notnull()].tolist())))
# notice here I did not conver the composite id into string format , I keep them into different columns which more easy to understand 
Out[53]: 
                                  objs  Count       g1    g2
key1                                                        
0     [Ball, Balloon, Bottle, Thought]      4  {92, 3}   {7}
1                             [Person]      1     {14}  {11}

PS:如果您需要有关网络步骤的更多详细信息,请检查BEN_YO的答案是否正确,但是这里有一个更详细的解决方案,我为分组集合构建了“第一个键”的映射:

# using four id fields instead of 2
grouping_fields = ['group1_id', 'group2_id', 'group3_id', 'group4_id']
id_fields = df.loc[df[grouping_fields].notnull().any(axis=1), grouping_fields]

# build a set of all similarly-grouped items
# and use the 'first seen' as the grouping key for that
FIRST_SEEN_TO_ALL = defaultdict(set)
KEY_TO_FIRST_SEEN = {}

for row in id_fields.to_dict('records'):
    # why doesn't nan fall out in a boolean check?
    keys = [id for id in row.values() if id and (str(id) != 'nan')]
    row_id = keys[0]
    for key in keys:
        if (row_id != key) or (key not in KEY_TO_FIRST_SEEN):
            KEY_TO_FIRST_SEEN[key] = row_id
            first_seen_key = row_id
        else:
            first_seen_key = KEY_TO_FIRST_SEEN[key]
        FIRST_SEEN_TO_ALL[first_seen_key].add(key)

def fetch_group_id(row):
    keys = filter(None, row.to_dict().values())
    for key in keys:
        first_seen_key = KEY_TO_FIRST_SEEN.get(key)
        if first_seen_key: 
            return first_seen_key

df['group_super'] = df[grouping_fields].apply(fetch_group_id, axis=1)

BEN_YO的答案是正确的,但是这里有一个更详细的解决方案,我为分组集合构建了“第一个键”的映射:

# using four id fields instead of 2
grouping_fields = ['group1_id', 'group2_id', 'group3_id', 'group4_id']
id_fields = df.loc[df[grouping_fields].notnull().any(axis=1), grouping_fields]

# build a set of all similarly-grouped items
# and use the 'first seen' as the grouping key for that
FIRST_SEEN_TO_ALL = defaultdict(set)
KEY_TO_FIRST_SEEN = {}

for row in id_fields.to_dict('records'):
    # why doesn't nan fall out in a boolean check?
    keys = [id for id in row.values() if id and (str(id) != 'nan')]
    row_id = keys[0]
    for key in keys:
        if (row_id != key) or (key not in KEY_TO_FIRST_SEEN):
            KEY_TO_FIRST_SEEN[key] = row_id
            first_seen_key = row_id
        else:
            first_seen_key = KEY_TO_FIRST_SEEN[key]
        FIRST_SEEN_TO_ALL[first_seen_key].add(key)

def fetch_group_id(row):
    keys = filter(None, row.to_dict().values())
    for key in keys:
        first_seen_key = KEY_TO_FIRST_SEEN.get(key)
        if first_seen_key: 
            return first_seen_key

df['group_super'] = df[grouping_fields].apply(fetch_group_id, axis=1)

哇,太令人印象深刻了。你怎么这么快就想到了?从图表的角度来看很有趣。…。@carl.hiass这更像是体验~我还添加了一个更详细的解决方案。我的看起来还好吗?出于好奇,你对如何提高我的熊猫技能有什么建议?你是怎么变得这么好的?@carl.hiass你的解决方案看起来很好~:-在实际问题中多使用pandas将帮助你增加pandas和其他功能工具,这非常令人印象深刻。你怎么这么快就想到了?从图表的角度来看很有趣。…。@carl.hiass这更像是体验~我还添加了一个更详细的解决方案。我的看起来还好吗?出于好奇,你对如何提高我的熊猫技能有什么建议?您是如何变得如此优秀的?@carl.hiass您的解决方案看起来不错~:-更多地使用pandas解决实际问题将帮助您增加pandas和其他功能工具