（Python）查找两个数组中与其他两个数组中的值相等的值的索引_Python_Arrays_Numpy_Equals_Indices

（Python）查找两个数组中与其他两个数组中的值相等的值的索引

python arrays numpy

（Python）查找两个数组中与其他两个数组中的值相等的值的索引,python,arrays,numpy,equals,indices,Python,Arrays,Numpy,Equals,Indices,我有以下4个数组，我想得到数组A和X中相等的值的索引，对应于B和Y中相同位置的值。因此对于以下示例 import numpy as np A = np.asarray([400.5, 100, 700, 200, 15, 900]) B = np.asarray([500.5, 200, 500, 600.5, 8, 999]) X = np.asarray([400.5, 700, 100, 300, 15, 555, 900])

我有以下4个数组，我想得到数组A和X中相等的值的索引，对应于B和Y中相同位置的值。因此对于以下示例

    import numpy as np
    A = np.asarray([400.5, 100,  700,   200,  15, 900])
    B = np.asarray([500.5, 200,  500, 600.5,   8, 999])
    X = np.asarray([400.5, 700,  100,   300,  15, 555, 900])
    Y = np.asarray([500.5, 500,600.5,   100,   8, 555, 999])

我想得到两个具有索引的数组：

indAB=[0 2 4 5]

0，因为A&B中的400.5和500.5也位于位置0的X&Y中
2因为A&B中的700和500也在位置2的X&Y中
因为A&B中的15和8也在位置4的X&Y中
5因为A&B中的900和999也在位置5的X&Y中

indXY=[0 1 4 6]

发现0、1、4和6与indAB相似，但与X&Y有关

其中，indAB是A和B中等于X和Y中的值的指数，indXY是X和Y中等于A和B中的值的指数

这是我迄今为止的尝试：

    def indices(a,b):
        setb = set(b)
        ind = [i for i, x in enumerate(a) if x in setb]
        return ind

    iA = np.asarray(indices(A,X))
    iB = np.asarray(indices(X,A))
    iX = np.asarray(indices(B,Y))
    iY = np.asarray(indices(Y,B))

    def CommonIndices(a,b):
        return np.asarray(list(set(a) & set(b)))

    indAB = CommonIndices(iA,iX)
    indXY = CommonIndices(iB,iY)

    print(indAB) # returns = [0 2 4 5]
    print(indXY) # returns = [0 1 2 4 6]

对于indXY，我一直得到[0 1 2 4 6]，这是不正确的。不应包括2，因为即使Y和B中有600.5，A和B中的200和100（分别）也不相等

如果有人能提出解决办法，我将不胜感激。非常感谢

试试这个：

import numpy as np

A = np.asarray([400.5, 100,  700,   200,  15, 900])
B = np.asarray([500.5, 200,  500, 600.5,   8, 999])
X = np.asarray([400.5, 700,  100,   300,  15, 555, 900])
Y = np.asarray([500.5, 500,600.5,   100,   8, 555, 999])

AB = np.stack([A, B], axis=-1)
XY = np.stack([X, Y], axis=-1)

eq = AB[:, np.newaxis, :] == XY[np.newaxis, :, :]
eq = np.logical_and.reduce(eq, axis=-1)

indAB, = np.where(np.logical_or.reduce(eq, axis=1))
indXY, = np.where(np.logical_or.reduce(eq, axis=0))

print("indAB", indAB)
print("indXY", indXY)

输出：

indAB [0 2 4 5]
indXY [0 1 4 6]

解释

AB

和

XY

只是数组

和

分别“堆叠”成二维数组

eq

保存

AB

和

XY

中元素的所有比较

np。newaxis

用于向

AB

和

XY

添加尺寸（请注意，

AB

在位置1和

XY

在位置0处获取新尺寸）。相等运算符

==

通过数组的新维度广播数组。第一个

np.logical\u和.reduce

操作是确保两个“组件”相等（

到

和

到

），以及

np.logical\u或.reduce

操作检查

AB

到

XY

和

XY

到

AB

之间是否存在完全相等。最后，

np.where

获取索引

作为缺点，请注意，这需要一个大小为

len（a）

len（x）

x2的布尔数组，因此如果原始数组非常大，可能会遇到内存问题

更新

如前所述，超大阵列可能是一个问题。如果您想“一次完成”所有比较，实际上没有办法解决（中间数组的大小只是比较的数量）。但是，您也可以“按块”运行算法，例如：

import numpy as np

MAX_SIZE = 2  # Biggest array will be MAX_SIZE x MAX_SIZE x 2

A = np.asarray([400.5, 100,  700,   200,  15, 900])
B = np.asarray([500.5, 200,  500, 600.5,   8, 999])
X = np.asarray([400.5, 700,  100,   300,  15, 555, 900])
Y = np.asarray([500.5, 500,600.5,   100,   8, 555, 999])

AB = np.stack([A, B], axis=-1)
XY = np.stack([X, Y], axis=-1)

maskAB = np.full(len(AB), False, dtype=bool)
maskXY = np.full(len(XY), False, dtype=bool)

for iAB in range(0, len(AB), MAX_SIZE):
    pAB = np.expand_dims(AB[iAB:iAB + MAX_SIZE], axis=1)
    for iXY in range(0, len(XY), MAX_SIZE):
        pXY = np.expand_dims(XY[iXY:iXY + MAX_SIZE], axis=0)
        eq = pAB == pXY
        eq = np.logical_and.reduce(eq, axis=-1)
        maskAB[iAB:iAB + MAX_SIZE] |= np.logical_or.reduce(eq, axis=1)
        maskXY[iXY:iXY + MAX_SIZE] |= np.logical_or.reduce(eq, axis=0)

indAB, = np.where(maskAB)
indXY, = np.where(maskXY)

print("indAB", indAB)
print("indXY", indXY)

输出仍然是：

indAB [0 2 4 5]
indXY [0 1 4 6]

我使用的

MAX_SIZE

值为2，只是为了说明它在示例中是有效的，但实际上，您可以根据您愿意使用的最大内存量来选择它（例如

MAX_SIZE=10000

的大小应为数百兆字节）

MAX_SIZE

不需要小于数组的大小，也不需要是数组大小的除数。

包（免责声明：我是它的作者）包含有效且优雅地完成这类工作的功能。内存需求是线性的，这种方法的计算需求是非线性的。对于您正在考虑的大量阵列，与当前接受的蛮力方法相比，速度优势可以轻松达到几个数量级：

import numpy as np
import numpy_indexed as npi

A = np.asarray([400.5, 100,  700,   200,  15, 900])
B = np.asarray([500.5, 200,  500, 600.5,   8, 999])
X = np.asarray([400.5, 700,  100,   300,  15, 555, 900])
Y = np.asarray([500.5, 500,600.5,   100,   8, 555, 999])

AB = np.stack([A, B], axis=-1)
XY = np.stack([X, Y], axis=-1)

# casting the AB and XY arrays to npi.index first is not required, but a performance optimization; without this each call to npi.indices would have to re-index the arrays, which is the expensive part
AB = npi.as_index(AB)
XY = npi.as_index(XY)
# npi.indices(list, items) is a vectorized nd-equivalent of list.index(item)
indAB = npi.indices(AB, XY, missing='mask').compressed()
indXY = npi.indices(XY, AB, missing='mask').compressed()

请注意，您也可以选择如何处理缺少的值。还可以查看集合操作，例如npi.交点（XY，AB）；他们可能会提供一个更简单的途径，让你在更高的层次上实现目标。

这里有一个替代方法。我敢说这是相对清楚的，由于使用了集合，它应该是高效的，而且它只需要

O（len（A）+len（X））

内存

numpy

甚至不需要，但可以用于阵列

from collections import defaultdict

A = [400.5, 100, 700, 200, 15, 900]
B = [500.5, 200, 500, 600.5, 8, 999]
X = [400.5, 700, 100, 300, 15, 555, 900]
Y = [500.5, 500, 600.5, 100, 8, 555, 999]

def get_indices(values):
    d = defaultdict(set)
    for i, value in enumerate(values):
        d[value].add(i)
    return d

iA, iB, iX, iY = [get_indices(values) for values in [A, B, X, Y]]
print(iA)
# {400.5: {0}, 100: {1}, 200: {3}, 900: {5}, 700: {2}, 15: {4}}
print(iX)
# {400.5: {0}, 100: {2}, 300: {3}, 900: {6}, 555: {5}, 700: {1}, 15: {4}}

for i, (a, b) in enumerate(zip(A, B)):
    common_indices = iX[a] & iY[b]
    if common_indices:
        print("A B : %d" % i)
        print("X Y : %d" % common_indices.pop())
        print()

#   A B : 0
#   X Y : 0

#   A B : 2
#   X Y : 1

#   A B : 4
#   X Y : 4

#   A B : 5
#   X Y : 6

非常感谢你！！这正是我所需要的，我自己也永远不会到那里的，哈哈。谢谢：D@TimeExplorer没问题。我添加了一些解释，以防您（或任何找到答案的人）觉得它有用。数组确实相当大，35182*2044207*2=1438581348。有没有一种方法不太昂贵？@TimeExplorer我添加了一种代码变体，可以“按片段”运行算法，它应该允许您限制使用的内存量。很好。它看起来确实比公认的答案更有效率。我不知道您是如何实现的

npi

。据我所知，我的答案与你的答案有关，但与普通Python对象有关；npi是用“纯numpy”编写的，因此执行集合类型操作的诀窍是（arg）排序数组并将相关项组合在一起。因此，O（NlogN）性能比方法应该具有的O（N）性能要好。但对于许多现实世界的案例，排序结果比O（NlogN）好，因为数据很少是完全随机的；当然，矢量化很难打败。numpy_索引包工作得很好。非常感谢D@EelcoHoogendoorn：非常感谢你的回答。与作者交流第一手经验总是很好的。