Python 使用所有不同的值进行自连接，并在中应用聚合函数_Python_Pandas

Python 使用所有不同的值进行自连接，并在中应用聚合函数

python pandas

Python 使用所有不同的值进行自连接，并在中应用聚合函数,python,pandas,Python,Pandas,我有一个Python数据框架，包括专栏、作者、专栏1、专栏2、…、专栏100 Dtypes: Paper type: string (unique) Author type: string col_x: float 我知道我要做的事情很复杂，而且性能很高，但我的解决方案需要花费很长时间才能完成对于数据帧中的每一行，我希望与该行中与作者不同的所有作者进行自联接。然后在值col_x中应用一个函数，并将上的每一行与另一行col_x合并，得到一些聚合结果我的解决方案使用了我知道最慢的iterrow

我有一个Python数据框架，包括

专栏、作者、专栏1、专栏2、…、专栏100

Dtypes:
Paper type: string (unique)
Author type: string
col_x: float

我知道我要做的事情很复杂，而且性能很高，但我的解决方案需要花费很长时间才能完成

对于数据帧中的每一行，我希望与该行中与

作者不同的所有作者进行自联接。然后在值col_x
中应用一个函数，并将上的每一行与另一行col_x
合并，得到一些聚合结果
我的解决方案使用了我知道最慢的iterrows
，但我想不出任何其他方法
from sklearn.metrics.pairwise import cosine_similarity
from statistics import mean 

papers = ... #is my dataframe
cols = ['min_col', 'avg_col', 'max_col', 'label']
all_cols = ['col_1', 'col_2', ..., 'col_100']

df_result = pd.DataFrame({}, columns = cols)

for ind, paper in papers.iterrows():

    col_vector = paper[all_cols].values.reshape(1,-1) #bring the columns in the correct format   
    temp = papers[papers.author != paper.author].author.unique() #get all authors that are not the same with the one in the row
    for auth in temp:
        temp_papers = papers[papers.author == auth]  #get all papers of that author 
        if temp_papers.shape[0] > 1: #if I have more than 1 paper find the cosine_similarity of the row and the joined rows
            res = []
            for t_ind, t_paper in temp_papers.iterrows():
                res.append(cosine_similarity(col_vector, t_paper[all_cols].values.reshape(1,-1))[0][0])

            df_result = df_result.append(pd.DataFrame([[min(res), mean(res), max(res), 0]], columns = cols), ignore_index = True)

第2版：
我还尝试对自身进行交叉连接，然后排除具有相同作者的行。然而，当我这样做时，我在几行中得到相同的错误
papers['key'] = 0['key'] = 0
cross = papers.merge(papers, on = 'key', how = 'outer')
>> [IPKernelApp] WARNING | No such comm: 3a1ea2fa71f711ea847aacde48001122

额外信息

数据帧的大小为45k行
大约有5千名独立作者
首先，如果数据帧不是太大（在您的情况下，它似乎太大），您可以通过使用余弦相似性的矢量化来实现。要做到这一点，首先需要一个所有作者都有一行以上的掩码，创建一个数据框，在索引和列中包含足够的信息，以便能够分组，然后查询所需的行：
# here are dummy variables
np.random.seed(10)
papers = pd.DataFrame({'author': list('aabbcdddae'), 
                       'col_1': np.random.randint(30, size=10), 
                       'col_2': np.random.randint(20, size=10), 
                       'col_3': np.random.randint(10, size=10),})
all_cols = ['col_1', 'col_2','col_3']

第一个解决方案：
#mask author with more than 1 row
mask_author = papers.groupby('author')['author'].transform('count').gt(1)

# use cosine_similarity with all the rows at a time
# compared to all the rows with authors with more than a row
df_f = (pd.DataFrame(cosine_similarity(papers.loc[:,all_cols],papers.loc[mask_author,all_cols]), 
                     # create index and columns to keep some info about authors
                     index=pd.MultiIndex.from_frame(papers['author'].reset_index(), 
                                                    names=['index_ori', 'author_ori']), 
                     columns=papers.loc[mask_author,'author'])
          # put all columns as rows to be able to perform a groupby all index levels and agg
          .stack()
          .groupby(level=[0,1,2], axis=0).agg([min, 'mean', max])
          # remove rows that compared authors with themself
          .query('author_ori != author')
          # add label column with 0, not sure why
          .assign(label=0)
          # reset index as you don't seem to care
          .reset_index(drop=True))

现在的问题是，有45K行和5K名作者，我怀疑普通计算机能否处理前面的方法。然后，我们的想法是执行相同的操作，但每个组作者：
# mask for authors with more than a row
mask_author = papers.groupby('author')['author'].transform('count').gt(1)
# instead of doing it for each iteration, save the df with authors with more than a row
papers_gt1 = papers.loc[mask_author, :]

# compared to your method, it is more efficient to same dataframes in a list and concat at the end
# than using append on a dataframe at each iteration
res = []
# iterate over each authors
for auth, dfg in papers[all_cols].groupby(papers['author']):
    # mask for to remove the current author of the comparison df
    mask_auth = papers_gt1['author'].ne(auth)
    # append the dataframe build on the same idea than the first solution
    # with small difference in operation as you already have not the same author in both 
    # dfg and papers_gt1.loc[mask_auth, all_cols]
    res.append(pd.DataFrame(cosine_similarity(dfg, papers_gt1.loc[mask_auth, all_cols]), 
                            columns=papers_gt1.loc[mask_auth, 'author'])
                 .stack()
                 .groupby(level=[0, 1]).agg([min, 'mean', max]))
#outside of the loop concat everything and add label column
df_f = pd.concat(res, ignore_index=True).assign(label=0)

注意：整个操作仍然很长，但在您的代码中，您在多个级别上降低了效率，如果您想保持iterrows
，这里有几点可以提高代码的效率：

正如您所提到的，不建议使用iterrows
，但是两个iterows加上另一个循环确实很慢
第二个iterrows
没有利用cosine_相似度
对具有多个维度的输入数组进行了审查
执行temp=papers[papers.author！=papers.author].author.unique（）
每次迭代都会浪费大量时间，可以在循环之前创建唯一作者列表，然后在循环中创建唯一作者列表，只需检查当前papers.author
与不同（使用您的符号）

同样的想法，在每个auth
之前都可以做if temp_papers.shape[0]>1
，我假设纸张的数量没有变化，因此如果创建唯一auth
外部循环列表（上一点），它可能已经不包括只有一篇论文的作者
最后，在每个循环的数据帧上使用append
是一个巨大的时间损失，请参阅定时比较，因此最好创建另一个列表res\u agg
，您可以这样做res\u agg.append（[min（res），mean（res），max（res），0]）
在所有循环之后，df\u result=pd.dataframe（res_agg，columns=cols）
如果你的操作可以通过numpy函数来完成，你是否已经尝试过了？请参考ubuntu的回答：@emiljojΝο。我没有尝试过。有时候，我更容易通过简单的迭代以更线性的方式来思考问题。但是如果有一种方法可以通过向量来完成，那么我很乐意尝试：）@Tasos可以添加一些输入，比如10-20行，作者很少，以及几个col_x，以便能够测试代码并查看输出。另外，您的数据帧中有多少唯一作者和总行，答案也可能取决于这两个数字；）@Ben.T添加了关于尺寸的信息。将尝试以可发布的格式获取示例here@Tasos因此，如果我很好地理解了您的代码，在df_结果中，您将为原始论文中的每一行，每个其他作者有一行。那么大约是45K*5K，2.5亿行，对吗？