Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/algorithm/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 查找将一个NumPy数组的行映射到另一个NumPy数组的一组索引_Python_Algorithm_Sorting_Numpy_Mapping - Fatal编程技术网

Python 查找将一个NumPy数组的行映射到另一个NumPy数组的一组索引

Python 查找将一个NumPy数组的行映射到另一个NumPy数组的一组索引,python,algorithm,sorting,numpy,mapping,Python,Algorithm,Sorting,Numpy,Mapping,我有两个结构化的2Dnumpy数组,它们原则上是相等的,意思是 A = numpy.array([[a1,b1,c1], [a2,b2,c2], [a3,b3,c3], [a4,b4,c4]]) B = numpy.array([[a2,b2,c2], [a4,b4,c4], [a3,b3,c3],

我有两个结构化的2D
numpy
数组,它们原则上是相等的,意思是

A = numpy.array([[a1,b1,c1],
                 [a2,b2,c2],
                 [a3,b3,c3],
                 [a4,b4,c4]]) 

B = numpy.array([[a2,b2,c2],
                 [a4,b4,c4],
                 [a3,b3,c3],
                 [a1,b1,c1]])
不是指

numpy.array_equal(A,B) # False
numpy.array_equiv(A,B) # False
numpy.equal(A,B) # ndarray of True and False
但是,在某种意义上,一个数组
(A)
是原始数组,而在另一个数组
(B)
中,数据沿一个轴(可以沿行或列)移动

排序/洗牌
B
以匹配或等于
A
或排序
A
以等于
B
的有效方法是什么?相等性检查实际上并不重要,只要两个数组都被洗牌以相互匹配即可<代码>A因此
B
具有唯一的行

我尝试了
view
方法对这两个数组进行排序

def sort2d(A):
    A_view = np.ascontiguousarray(A).view(np.dtype((np.void,
             A.dtype.itemsize * A.shape[1])))
    A_view.sort()
    return A_view.view(A.dtype).reshape(-1,A.shape[1])   

但这显然不起作用。这个操作需要对非常大的阵列执行,因此性能和可伸缩性是至关重要的。

根据您的示例,您似乎同时洗牌了所有列,因此有一个行索引向量映射→B.以下是一个玩具示例:

A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]

print(repr(A))
# array([[ 7, 11,  6],
#        [ 4, 10,  8],
#        [ 9,  2,  0],
#        [ 1,  3,  5]])

print(repr(B))
# array([[ 1,  3,  5],
#        [ 4, 10,  8],
#        [ 7, 11,  6],
#        [ 9,  2,  0]])
我们希望恢复一组索引,
idx
,这样
a[idx]==B
。当且仅当a和B不包含重复行时,这将是唯一的映射


一种有效的*方法是找到将对A中的行进行词汇排序的索引,然后找到B中的每一行在A的排序版本中的位置。即使用将每一行视为单个元素的
np.void
dtype将
A
B
视为1D数组:

rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy here
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()

a_to_as = np.argsort(a)     # indices that sort the rows of A in lexical order
现在,我们可以使用执行二进制搜索,查找B中的每一行在a的排序版本中的位置:

从一个→B可以表示为a的组合→作为→B

如果A和B不包含重复的行,则B的反向映射→也可以使用

b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True
作为单一功能: 基准:

*最昂贵的步骤是对行进行快速排序,平均为O(n logn)。我不确定是否有可能做得比这更好。

因为两个阵列中的任何一个都可以被洗牌以匹配另一个,所以没有人阻止我们重新安排这两个阵列。使用,我们可以
vstack
两个数组并找到唯一的行。然后,unique返回的反向索引本质上就是所需的映射(因为数组不包含重复的行)

为了方便起见,我们首先定义一个
unique2d
函数:

def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False): 
    """Get unique values along an axis for 2D arrays.

        input:
            arr:
                2D array
            consider_sort:
                Does permutation of the values within the axis matter? 
                Two rows can contain the same values but with 
                different arrangements. If consider_sort 
                is True then those rows would be considered equal
            return_index:
                Similar to numpy unique
            return_inverse:
                Similar to numpy unique
        returns:
            2D array of unique rows
            If return_index is True also returns indices
            If return_inverse is True also returns the inverse array 
            """

    if consider_sort is True:
        a = np.sort(arr,axis=1)
    else:
        a = arr
    b = np.ascontiguousarray(a).view(np.dtype((np.void, 
            a.dtype.itemsize * a.shape[1])))

    if return_inverse is False:
        _, idx = np.unique(b, return_index=True)
    else:
        _, idx, inv = np.unique(b, return_index=True, return_inverse=True)

    if return_index == False and return_inverse == False:
        return arr[idx]
    elif return_index == True and return_inverse == False:
        return arr[idx], idx
    elif return_index == False and return_inverse == True:
        return arr[idx], inv
    else:
        return arr[idx], idx, inv
我们现在可以如下定义映射

def row_mapper(a,b,consider_sort=False):
    """Given two 2D numpy arrays returns mappers idx_a and idx_b 
        such that a[idx_a] = b[idx_b] """

    assert a.dtype == b.dtype
    assert a.shape == b.shape

    c = np.concatenate((a,b))
    _, inv = unique2d(c, consider_sort=consider_sort, return_inverse=True)
    mapper_a = inv[:b.shape[0]]
    mapper_b = inv[b.shape[0]:]

    return np.argsort(mapper_a), np.argsort(mapper_b) 
验证

n = 100000
A = np.arange(n).reshape(n//4,4)
B = A[::-1,:]

idx_a, idx_b  = row_mapper(A,B)
print np.all(A[idx_a]==B[idx_b])
# True
基准: 以@ali_m的解决方案为基准

%timeit find_row_mapping(A,B) # ali_m's solution
%timeit row_mapper(A,B) # current solution

# n = 100
100000 loops, best of 3: 12.2 µs per loop
10000 loops, best of 3: 47.3 µs per loop

# n = 1000
10000 loops, best of 3: 49.1 µs per loop
10000 loops, best of 3: 148 µs per loop

# n = 10000
1000 loops, best of 3: 548 µs per loop
1000 loops, best of 3: 1.6 ms per loop

# n = 100000
100 loops, best of 3: 6.96 ms per loop
100 loops, best of 3: 19.3 ms per loop

# n = 1000000
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 372 ms per loop

# n = 10000000
1 loops, best of 3: 2.54 s per loop
1 loops, best of 3: 5.92 s per loop

虽然可能还有改进的余地,但当前的解决方案比ali_m的解决方案慢2-3倍,而且可能有点混乱,两个阵列都需要映射。我想这可能是另一种解决办法

B[:]=A
有什么问题吗?我不能这样做,因为
A
B
的行分别进一步映射到数组
C
D
,而
A
B
的(行)顺序决定了
C
D
中的值,所以我不能把顺序搞乱。我的想法是一样的,但是你确定这种方法对任何一对数组都适用吗?在我的笔记本上,您的代码连续运行了两次:。顺便说一下,我想你的意思是
idx
而不是最后一行的
perm
。对不起,我搞砸了我的正向和反向映射。现在应该可以了,不用担心。我使用Jaime的答案尝试了另一种方法,找到两个数组的唯一行
vstack
ed在一起(发布在下面),尽管它看起来不像您的解决方案那样优雅。
np.unique
。性能差异可能归结为这样一个事实:您对两个数组进行排序,而不是只对一个数组进行排序,并对输入进行一些额外的复制。
def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False): 
    """Get unique values along an axis for 2D arrays.

        input:
            arr:
                2D array
            consider_sort:
                Does permutation of the values within the axis matter? 
                Two rows can contain the same values but with 
                different arrangements. If consider_sort 
                is True then those rows would be considered equal
            return_index:
                Similar to numpy unique
            return_inverse:
                Similar to numpy unique
        returns:
            2D array of unique rows
            If return_index is True also returns indices
            If return_inverse is True also returns the inverse array 
            """

    if consider_sort is True:
        a = np.sort(arr,axis=1)
    else:
        a = arr
    b = np.ascontiguousarray(a).view(np.dtype((np.void, 
            a.dtype.itemsize * a.shape[1])))

    if return_inverse is False:
        _, idx = np.unique(b, return_index=True)
    else:
        _, idx, inv = np.unique(b, return_index=True, return_inverse=True)

    if return_index == False and return_inverse == False:
        return arr[idx]
    elif return_index == True and return_inverse == False:
        return arr[idx], idx
    elif return_index == False and return_inverse == True:
        return arr[idx], inv
    else:
        return arr[idx], idx, inv
def row_mapper(a,b,consider_sort=False):
    """Given two 2D numpy arrays returns mappers idx_a and idx_b 
        such that a[idx_a] = b[idx_b] """

    assert a.dtype == b.dtype
    assert a.shape == b.shape

    c = np.concatenate((a,b))
    _, inv = unique2d(c, consider_sort=consider_sort, return_inverse=True)
    mapper_a = inv[:b.shape[0]]
    mapper_b = inv[b.shape[0]:]

    return np.argsort(mapper_a), np.argsort(mapper_b) 
n = 100000
A = np.arange(n).reshape(n//4,4)
B = A[::-1,:]

idx_a, idx_b  = row_mapper(A,B)
print np.all(A[idx_a]==B[idx_b])
# True
%timeit find_row_mapping(A,B) # ali_m's solution
%timeit row_mapper(A,B) # current solution

# n = 100
100000 loops, best of 3: 12.2 µs per loop
10000 loops, best of 3: 47.3 µs per loop

# n = 1000
10000 loops, best of 3: 49.1 µs per loop
10000 loops, best of 3: 148 µs per loop

# n = 10000
1000 loops, best of 3: 548 µs per loop
1000 loops, best of 3: 1.6 ms per loop

# n = 100000
100 loops, best of 3: 6.96 ms per loop
100 loops, best of 3: 19.3 ms per loop

# n = 1000000
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 372 ms per loop

# n = 10000000
1 loops, best of 3: 2.54 s per loop
1 loops, best of 3: 5.92 s per loop