Scikit learn Scikit学习中的成对操作和每对上的不同过滤条件_Scikit Learn_Pairwise

Scikit learn Scikit学习中的成对操作和每对上的不同过滤条件

scikit-learn

Scikit learn Scikit学习中的成对操作和每对上的不同过滤条件,scikit-learn,pairwise,Scikit Learn,Pairwise,我有以下两个数据帧，比如说df1 a b c d 0 0 1 2 3 1 4 0 0 7 2 8 9 10 11 3 0 0 0 15 和df2 a b c d 0 5 1 2 3 我感兴趣的是对df1中的每一行和df2中的单行进行成对操作。但是，如果df1行中的一列为0，则该列在df1行和df2行中都不用于执行成对操作。因此，每个成对操作将在不同长度的成对行上工作。让我把它分解成

我有以下两个数据帧，比如说df1

    a   b   c   d
0   0   1   2   3
1   4   0   0   7
2   8   9  10  11
3   0   0  0  15

和df2

    a   b   c   d
0   5   1   2   3

我感兴趣的是对df1中的每一行和df2中的单行进行成对操作。但是，如果df1行中的一列为0，则该列在df1行和df2行中都不用于执行成对操作。因此，每个成对操作将在不同长度的成对行上工作。让我把它分解成4个比较

比较1

01123对5123 由于列a的值为0，所以成对操作是在1 2 3对1 2 3上执行的

比较2

4 0 0 7 vs 5 1 2 3是在4 7 vs 5 3上完成的，因为我们有2列需要删除

比较3

8 9 10 11 vs 5 1 2 3是在8 9 10 11 vs 5 1 2 3上完成的，因为没有删除任何列

比较4

0 0 0 15 vs 5 1 2 3在15 vs 3上完成，因为除了一列之外，所有列都被删除

每个两两操作的结果都是一个标量，因此结果是某种结构，无论是列表、数组、数据帧，还是具有4（或df1中的行数）值的任何结构。另外，我应该注意到，df2中的值是不相关的，并且不会根据df2中任何列的值进行过滤。

为简单起见，您可以尝试在数据帧中的每一行上循环，并执行以下操作：

import pandas as pd
import numpy as np

a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])

# loop over each row in 'a'
for i in range(len(a)):
    # find indicies of non-zero elements of the row
    non_zero = np.nonzero(a.iloc[i].to_numpy())[0]

   # perform pair-wise addition between non-zero elements in 'a' and the same elements in 'b'
    print(np.array(a.iloc[i])[(non_zero)] +  np.array(b.iloc[0])[(non_zero)])

这里我使用了成对加法，但是您可以用自己选择的操作替换加法

编辑： 如果数据帧较大，我们可能希望对此进行矢量化以避免循环。这里有一个想法，我们将零值转换为nan，以便在行操作中忽略它们：

import pandas as pd
import numpy as np

a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])

# find indicies of zeros
zeros = (a==0).values

# set zeros to nan
a[zeros] = np.nan

# tile and reshape 'b' so its the same shape as 'a'
b = pd.DataFrame(np.tile(b, len(a)).reshape(np.shape(a)), columns=b.columns)
# set the zero indices to nan
b[zeros] = np.nan

print('a:')
print(a)

print('b:')
print(b)

# now do some row-wise operation. For example take the sum of each row
print(np.sum(a+b, axis=1))

输出：

a:
     a    b     c   d
0  NaN  1.0   2.0   3
1  4.0  NaN   NaN   7
2  8.0  9.0  10.0  11
3  NaN  NaN   NaN  15
b:
     a    b    c  d
0  NaN  1.0  2.0  3
1  5.0  NaN  NaN  3
2  5.0  1.0  2.0  3
3  NaN  NaN  NaN  3
sum:
0    12.0
1    19.0
2    49.0
3    18.0
dtype: float64

这是肯定的，谢谢大家，因为我将研究for循环中的两行代码，以便理解。我已经试过了，得到了我期望的答案。我会向上投票，但我会问，这个答案是否合理？假设不是4行，而是10000行呢？在我以前的尝试中，我试图避免for循环，但如果必要，这是必要的。我们可以使用numpy对操作进行矢量化。给我一点时间来做。我编辑了答案，展示了一个避免循环的想法。汉克，可能要到周一才会回来，但会让你知道时间安排的结果。最初的for循环方式是痛苦的，但它可以完成，所以希望这将是我所期待的，是的，所以我注意到了速度的差异。这很好，不值得记录，因为这是从for循环到矢量化的期望。如果所有的值都是正确的（我做了argsort以获得最大值的索引），那么就有点难发现了，但是结果是可信的，所以我认为这种通用方法确实适合我。