Python 使用标识符列对数组的行进行排序以匹配另一个数组的顺序_Python_Performance_Numpy

Python 使用标识符列对数组的行进行排序以匹配另一个数组的顺序

python performance numpy

Python 使用标识符列对数组的行进行排序以匹配另一个数组的顺序,python,performance,numpy,Python,Performance,Numpy,我有两个这样的阵列： A = [[111, ...], B = [[222, ...], [222, ...], [111, ...], [333, ...], [333, ...], [555, ...]] [444, ...], [555, ...]] 其中第一列包含标识符，其余列包含一些数据，其中

我有两个这样的阵列：

A = [[111, ...],          B = [[222, ...],
     [222, ...],               [111, ...],
     [333, ...],               [333, ...],
     [555, ...]]               [444, ...],
                               [555, ...]]

其中第一列包含标识符，其余列包含一些数据，其中B的列数远大于A的列数。标识符是唯一的。A中的行数可以少于B中的行数，因此在某些情况下，需要空的间隔行。
我正在寻找一种有效的方法来匹配矩阵A和矩阵B的行，以便结果如下所示：

A = [[222, ...],
     [111, ...],
     [333, ...],
     [nan, nan], #could be any unused value
     [555, ...]]

我可以对两个矩阵进行排序或编写for循环，但这两种方法似乎都很笨拙。。。有更好的实现吗？

简单的方法是从

构建

dict

，然后使用它将

中找到的标识符映射到新数组

构造

dict

：

>>> A = [[1,"a"], [2,"b"], [3,"c"]]
>>> A_dict = {x[0]: x for x in A}
>>> A_dict
{1: [1, 'a'], 2: [2, 'b'], 3: [3, 'c']}

映射：

>>> B = [[3,"..."], [2,"..."], [1,"..."]]
>>> result = (A_dict[x[0]] for x in B)
>>> list(result)
[[3, 'c'], [2, 'b'], [1, 'a']]

不清楚是否要将

中的值连接到

上。让我们假设不是。。。那么最简单的方法可能是只构建一个标识符字典，然后对

重新排序：

def match_order(A, B):
    # identifier -> row
    by_id = {A[i, 0]: A[i] for i in range(len(A))}

    # make up a fill row and rearrange according to B
    fill_row = [-1] * A.shape[1]
    return numpy.array([by_id.get(k, fill_row) for k in B[:, 0]])

例如，如果我们有：

A = numpy.array([[111, 1], [222, 2], [333, 3], [555, 5]])
B = numpy.array([[222, 2], [111, 1], [333, 3], [444, 4], [555, 5]])

然后

如果您希望连接

，则只需执行以下操作：

>>> numpy.hstack( (match_order(A, B), B[:, 1:]) )
array([[222,   2,   2],
       [111,   1,   1],
       [333,   3,   3],
       [ -1,  -1,   4],
       [555,   5,   5]])

这里A[0]，B[1]和A[1]，B[0]是相同的。转换成一个dict并处理这个问题会使这里的生活更轻松

步骤1：为每个2D列表创建dict对象

第2步：在dict中迭代每个键并检查： A.如果B_dict中存在密钥， B如果是，请查看两个键是否具有相同的值

步骤3：追加键和值以形成二维列表

干杯

这是一种使用-

请注意，还可以使用

np.inad

：

np.inad（B[：，0]，A[：，0]）

创建

有效的\u掩码，以获得更直观的答案。但是，我们使用的是np.searchsorted
，因为这在性能方面更好，在中也进行了更详细的讨论
样本运行-
In [184]: A
Out[184]: 
array([[45, 11, 86],
       [18, 74, 59],
       [30, 68, 13],
       [55, 47, 78]])

In [185]: B
Out[185]: 
array([[45, 11, 88],
       [55, 83, 46],
       [95, 87, 77],
       [30,  9, 37],
       [14, 97, 98],
       [18, 48, 53]])

In [186]: out
Out[186]: 
array([[ 45.,  11.,  86.],
       [ 55.,  47.,  78.],
       [ nan,  nan,  nan],
       [ 30.,  68.,  13.],
       [ nan,  nan,  nan],
       [ 18.,  74.,  59.]])

您想对其余的列做什么？B的其余列是否应该附加到A的列上？您是否可以假定A
和B
具有相同的行数和相同的标识符集？标识符是唯一的？我想你忘了添加更多细节了。您的意思是匹配二维矩阵A和B中发现的相同列表（例如：列表长度和列表中的数据相同）？考虑到B中的列数远大于A，我想知道如何精确匹配…重要的一点，将这些添加到问题中，谢谢！另外，A
是否总是会被排序，或者它只是碰巧出现在问题中！这都是numpy
，所以感觉应该比我的建议快。。。但它似乎不是（至少在这个小阵列上）<代码>%timeit矢量化（A，B）=>9.6µs

%timeit匹配顺序（A，B）=>6.99µs

。这是因为那两个人吗？那更好。。。一旦我们将数组变大（例如1000行），矢量化的速度就会更快@donkopotamus，这是因为我们正在进行设置工作来处理它。矢量化方法通常是这样的，因为他们希望数组的大小合适。@donkopotamus同样，如果

A[：，0]

已经排序，我们可以通过避免创建和使用

sidx

@Divakar来节省大量的运行时间。这太好了，我就不能想出“np.searchsorted”方法了！谢谢

>>> A = [[3,'d', 'e', 'f'], [1,'a','b','c'], [2,'n','n','n']]
>>> B = [[1,'a','b','c'], [3,'d','e','f']]
>>> A_dict = {x[0]:x[1:] for x in A}
>>> A_dict
    {1: ['a', 'b', 'c'], 2: ['n', 'n', 'n'], 3: ['d', 'e', 'f']}
>>> B_dict = {x[0]:x[1:] for x in B}
>>> B_dict
    {1: ['a', 'b', 'c'], 3: ['d', 'e', 'f']} 
>>> result=[[x] + A_dict[x] for x in A_dict if x in B_dict and A_dict[x]==B_dict[x]]
>>> result
    [[1, 'a', 'b', 'c'], [3, 'd', 'e', 'f']]

# Store the sorted indices of A
sidx = A[:,0].argsort()

# Find the indices of col-0 of B in col-0 of sorted A
l_idx = np.searchsorted(A[:,0],B[:,0],sorter = sidx)

# Create a mask corresponding to all those indices that indicates which indices
# corresponding to B's col-0 match up with A's col-0
valid_mask = l_idx != np.searchsorted(A[:,0],B[:,0],sorter = sidx,side='right')

# Initialize output array with NaNs. 
# Use l_idx to set rows from A into output array. Use valid_mask to select 
# indices from l_idx and output rows that are to be set.
out = np.full((B.shape[0],A.shape[1]),np.nan)
out[valid_mask] = A[sidx[l_idx[valid_mask]]]

In [184]: A
Out[184]: 
array([[45, 11, 86],
       [18, 74, 59],
       [30, 68, 13],
       [55, 47, 78]])

In [185]: B
Out[185]: 
array([[45, 11, 88],
       [55, 83, 46],
       [95, 87, 77],
       [30,  9, 37],
       [14, 97, 98],
       [18, 48, 53]])

In [186]: out
Out[186]: 
array([[ 45.,  11.,  86.],
       [ 55.,  47.,  78.],
       [ nan,  nan,  nan],
       [ 30.,  68.,  13.],
       [ nan,  nan,  nan],
       [ 18.,  74.,  59.]])