查找Python中的共同好友数_Python_Python 3.x_Pandas_Dataframe_Mutual Friendship

查找Python中的共同好友数

python python-3.x pandas dataframe

查找Python中的共同好友数,python,python-3.x,pandas,dataframe,mutual-friendship,Python,Python 3.x,Pandas,Dataframe,Mutual Friendship,我有一个用户及其朋友的数据框架，看起来像： user_id | friend_id 1 3 1 4 2 3 2 5 3 4 我想在python中编写一个函数来计算每一对的共同好友数： user_id | friend_id | num_mutual 1 3 1 1 4 1 2 3 0 2 5

我有一个用户及其朋友的数据框架，看起来像：

user_id | friend_id
1         3
1         4
2         3
2         5
3         4

我想在

python

中编写一个函数来计算每一对的共同好友数：

user_id | friend_id | num_mutual
1         3           1
1         4           1
2         3           0
2         5           0
3         4           1

目前我有：

def find_mutual(df):
    num_mutual = []
    for i in range(len(df)):
        user, friend = df.loc[i, 'user_id'], df.loc[i, 'friend_id']
        user_list = df[df.user_id == user].friend_id.tolist() + df[df.friend_id == user].user_id.tolist()
        friend_list = df[df.user_id == friend].friend_id.tolist() + df[df.friend_id == friend].user_id.tolist()
        mutual = len(list(set(user_list) & set(friend_list)))
        num_mutual.append(mutual)
    return num_mutual

它适用于小型数据集，但我在具有数百万行的数据集上运行它。管理一切都要花很长时间。我知道这不是找到伯爵的理想方法。Python中有更好的算法吗？提前谢谢

丑陋的想法是构造一个4点路径，以

用户id开始，以相同的用户id结束。如果存在这样一条路径，那么两个起点就有了共同的朋友
我们从以下几点开始：
df
          user_id  friend_id
0        1          3
1        1          4
2        2          3
3        2          5
4        3          4

然后你可以做：
dff = df.append(df.rename(columns={"user_id":"friend_id","friend_id":"user_id"}))
df_new = dff.merge(dff, on="friend_id", how="outer")
df_new = df_new[df_new["user_id_x"]!= df_new["user_id_y"]]
df_new = df_new.merge(dff, left_on= "user_id_y", right_on="user_id")
df_new = df_new[df_new["user_id_x"]==df_new["friend_id_y"]]
df_out = df.merge(df_new, left_on=["user_id","friend_id"], right_on=["user_id_x","friend_id_x"], how="left",suffixes=("__","_"))
df_out["count"] = (~df_out["user_id_x"].isnull()).astype(int)
df_out[["user_id__","friend_id","count"]]

   user_id__  friend_id  count
0          1          3      1
1          1          4      1
2          2          3      0
3          2          5      0
4          3          4      1

使用图形方法的更优雅、更直接的方法
import networkx as nx
g = nx.from_pandas_edgelist(df, "user_id","friend_id")
nx.draw_networkx(g)


然后，您可以将共同好友数标识为存在3节点路径的2个相邻节点（2个好友）的路径数：
from networkx.algorithms.simple_paths import all_simple_paths
for row in df.itertuples():
    df.at[row[0],"count"] = sum([len(l)==3 for l in list(all_simple_paths(g, row[1], row[2]))])
print(df)
   user_id  friend_id  count
0        1          3    1.0
1        1          4    1.0
2        2          3    0.0
3        2          5    0.0
4        3          4    1.0

对于n
朋友，您实际上是在创建一个n^2
表。无论算法如何，计算都非常昂贵。我想你真的有两个不同的问题。首先，有没有更好的算法来解决这个问题，它需要的内存比n^2表少，并且运行时间接近O（n）时间。第二个问题是有一个Python库可以用来实现这个算法。虽然我对这两个问题都没有现成的答案，但您可能会考虑利用动态编程技术将问题分解成更小的部分。进一步思考。你可以把你的数据帧看作是图边列表的列表，然后看看解决这个问题，谢谢你的评论和建议！非常感谢。这正是我想要的。