Python 如何从一长串源/目标对中创建邻接矩阵？_Python_Pandas_Adjacency Matrix

Python 如何从一长串源/目标对中创建邻接矩阵？

python pandas

Python 如何从一长串源/目标对中创建邻接矩阵？,python,pandas,adjacency-matrix,Python,Pandas,Adjacency Matrix,鉴于以下数据： Class Name ====== ============= Math John Smith ------------------------- Math Jenny Simmons ------------------------- English Sarah Blume ------------------------- English John Smith -------------------------

鉴于以下数据：

Class       Name
======      =============
Math        John Smith
-------------------------
Math        Jenny Simmons
-------------------------
English     Sarah Blume
-------------------------
English     John Smith
-------------------------
Chemistry   Roger Tisch
-------------------------
Chemistry   Jenny Simmons
-------------------------
Physics     Sarah Blume
-------------------------
Physics     Jenny Simmons

我有一个类列表和每个类的名称，如下所示：

[
{class: 'Math', student: 'John Smith'},
{class: 'Math', student: 'Jenny Simmons'},
{class: 'English', student: 'Sarah Blume'},
{class: 'English', student: 'John Smith'},
{class: 'Chemistry', student: 'John Smith'},
{class: 'Chemistry', student: 'Jenny Simmons'},
{class: 'Physics', student: 'Sarah Blume'},
{class: 'Physics', student: 'Jenny Simmons'},
]

我想创建一个邻接矩阵，作为输入，它将具有以下结构，显示每对班级之间的共同学生人数：

我如何才能以最高效的方式在python/pandas中实现这一点？我的列表中有大约1900万对这样的班级/学生（约240MB）

您可以像这样准备邻接矩阵的数据：

# create the "class-tuples" by
# joining the dataframe with itself
df_cross= df.merge(df, on='student', suffixes=['_left', '_right'])
# remove the duplicate tuples
# --> this will get you a upper / or lower
# triangular matrix with diagonal = 0
# if you rather want to have a full matrix
# just change the >= to == below
del_indexer= (df_cross['class_left']>=df_cross['class_right'])
df_cross.drop(df_cross[del_indexer].index, inplace=True)
# create the counts / lists
grouby_obj= df_cross.groupby(['class_left', 'class_right'])
result= grouby_obj.count()
result.columns= ['value']
# if you want to have lists of student names
# that have the course-combination in
# common, you can do it with the following line
# otherwise just remove it (I guess with a 
# dataset of the size you mentioned, it will
# consume a lot of memory)
result['students']= grouby_obj.agg(list)

Out[133]: 
                        value                     students
class_left class_right                                    
Chemistry  English          1                 [John Smith]
           Math             2  [John Smith, Jenny Simmons]
           Physics          1              [Jenny Simmons]
English    Math             1                 [John Smith]
           Physics          1                [Sarah Blume]
Math       Physics          1              [Jenny Simmons]

result['value'].unstack()

Out[137]: 
class_right  English  Math  Physics
class_left                         
Chemistry        1.0   2.0      1.0
English          NaN   1.0      1.0
Math             NaN   NaN      1.0

完整输出如下所示：

# create the "class-tuples" by
# joining the dataframe with itself
df_cross= df.merge(df, on='student', suffixes=['_left', '_right'])
# remove the duplicate tuples
# --> this will get you a upper / or lower
# triangular matrix with diagonal = 0
# if you rather want to have a full matrix
# just change the >= to == below
del_indexer= (df_cross['class_left']>=df_cross['class_right'])
df_cross.drop(df_cross[del_indexer].index, inplace=True)
# create the counts / lists
grouby_obj= df_cross.groupby(['class_left', 'class_right'])
result= grouby_obj.count()
result.columns= ['value']
# if you want to have lists of student names
# that have the course-combination in
# common, you can do it with the following line
# otherwise just remove it (I guess with a 
# dataset of the size you mentioned, it will
# consume a lot of memory)
result['students']= grouby_obj.agg(list)

Out[133]: 
                        value                     students
class_left class_right                                    
Chemistry  English          1                 [John Smith]
           Math             2  [John Smith, Jenny Simmons]
           Physics          1              [Jenny Simmons]
English    Math             1                 [John Smith]
           Physics          1                [Sarah Blume]
Math       Physics          1              [Jenny Simmons]

result['value'].unstack()

Out[137]: 
class_right  English  Math  Physics
class_left                         
Chemistry        1.0   2.0      1.0
English          NaN   1.0      1.0
Math             NaN   NaN      1.0

然后，您可以使用@piRSquared的方法来旋转它，或者像这样做：

# create the "class-tuples" by
# joining the dataframe with itself
df_cross= df.merge(df, on='student', suffixes=['_left', '_right'])
# remove the duplicate tuples
# --> this will get you a upper / or lower
# triangular matrix with diagonal = 0
# if you rather want to have a full matrix
# just change the >= to == below
del_indexer= (df_cross['class_left']>=df_cross['class_right'])
df_cross.drop(df_cross[del_indexer].index, inplace=True)
# create the counts / lists
grouby_obj= df_cross.groupby(['class_left', 'class_right'])
result= grouby_obj.count()
result.columns= ['value']
# if you want to have lists of student names
# that have the course-combination in
# common, you can do it with the following line
# otherwise just remove it (I guess with a 
# dataset of the size you mentioned, it will
# consume a lot of memory)
result['students']= grouby_obj.agg(list)

Out[133]: 
                        value                     students
class_left class_right                                    
Chemistry  English          1                 [John Smith]
           Math             2  [John Smith, Jenny Simmons]
           Physics          1              [Jenny Simmons]
English    Math             1                 [John Smith]
           Physics          1                [Sarah Blume]
Math       Physics          1              [Jenny Simmons]

result['value'].unstack()

Out[137]: 
class_right  English  Math  Physics
class_left                         
Chemistry        1.0   2.0      1.0
English          NaN   1.0      1.0
Math             NaN   NaN      1.0

或者，如果您还需要名称：

result.unstack()
Out[138]: 
              value                   students                                              
class_right English Math Physics       English                         Math          Physics
class_left                                                                                  
Chemistry       1.0  2.0     1.0  [John Smith]  [John Smith, Jenny Simmons]  [Jenny Simmons]
English         NaN  1.0     1.0           NaN                 [John Smith]    [Sarah Blume]
Math            NaN  NaN     1.0           NaN                          NaN  [Jenny Simmons]

请参阅crosstab@piRSquared：链接文章肯定涵盖了问题的一部分，但不是全部，因为他没有您的方法所基于的结构中的数据。所以我不确定是否可以将其标记为副本。@jottbe ok。我把它重新打开，越来越彻底——一个极好的回答，一个非常清楚的解释！谢谢你的反馈。我很高兴能帮上忙。您是否已经在完整数据集上试用过？它的性能如何？对于完整的数据集，只需超过2分钟，这太棒了！