具有公差的两个多维数组之间的交集-NumPy/Python_Python_Numpy

具有公差的两个多维数组之间的交集-NumPy/Python

python numpy

具有公差的两个多维数组之间的交集-NumPy/Python,python,numpy,Python,Numpy,我被一个问题困住了。我有两个二维numpy数组，填充了x和y坐标。这些阵列可能看起来像： array1([[(1.22, 5.64)], [(2.31, 7.63)], [(4.94, 4.15)]], array2([[(1.23, 5.63)], [(6.31, 10.63)], [(2.32, 7.65)]], 现在我必须找到“重复节点”。然而，我也必须考虑节点在给定的坐标公差内是相等的，因此，我不能使用类似的解决方案。由于我的数组相当大（每行约200.000行

我被一个问题困住了。我有两个二维numpy数组，填充了x和y坐标。这些阵列可能看起来像：

array1([[(1.22, 5.64)],
   [(2.31, 7.63)],
   [(4.94, 4.15)]],

array2([[(1.23, 5.63)],
   [(6.31, 10.63)],
   [(2.32, 7.65)]],

现在我必须找到“重复节点”。然而，我也必须考虑节点在给定的坐标公差内是相等的，因此，我不能使用类似的解决方案。由于我的数组相当大（每行约200.000行），因此也不能选择两个简单的

for

循环。我的最终输出应该如下所示：

output([[(1.23, 5.63)],
   [(2.32, 7.65)]],

我希望得到一些提示

干杯，

为了与具有给定容差的节点进行比较，我建议使用

numpy.isclose（）

，您可以在其中设置相对和绝对容差

numpy.isclose(1.24, 1.25, atol=1e-1)
# [True]
numpy.isclose([1.24, 2.31], [1.25, 2.32], atol=1e-1)
# [True, True]

您可以使用

itertools.product（）

包遍历所有对，而不是对

循环使用两个。以下代码符合您的要求：
array1 = np.array([[1.22, 5.64],
                   [2.31, 7.63],
                   [4.94, 4.15]])

array2 = np.array([[1.23, 5.63],
                   [6.31, 10.63],
                   [2.32, 7.64]])

output = np.empty((0,2))
for i0, i1 in itertools.product(np.arange(array1.shape[0]),
                                np.arange(array2.shape[0])):
    if np.all(np.isclose(array1[i0], array2[i1], atol=1e-1)):
         output = np.concatenate((output, [array2[i1]]), axis=0)
# output = [[ 1.23  5.63]
#           [ 2.32  7.64]]

定义类似于numpy.isclose
的isclose
函数（主要原因是不检查任何输入，不支持相对和绝对公差）：
现在，我们希望所有条目的值接近任何其他值（沿相同维度）：
然后，我们只需要那些元组的两个值都很接近的值：
In [111]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)
Out[111]: array([ True,  True, False], dtype=bool)

最后，我们可以用它来索引数组1
：
In [92]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03)
Out[92]: 
array([[[ True,  True],
        [False, False],
        [False, False]],

       [[False, False],
        [False, False],
        [False, False]],

       [[False, False],
        [ True,  True],
        [False, False]]], dtype=bool)

In [112]: array1[isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)]
Out[112]: 
array([[[ 1.22,  5.64]],

       [[ 2.31,  7.63]]])

如果愿意，您可以交换任何
和所有
调用。在你的情况下，一个可能比另一个快
重塑
调用中的3
需要替换为数据的实际长度
使用itertools.product
，此算法将具有与另一个答案相同的坏运行时，但至少实际循环是由numpy
隐式完成的，并用C实现。这在计时中可见：
In [122]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
11.6 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [126]: %timeit pares(array1_pares, array2_pares)
267 µs ± 8.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

其中，pares
函数是由in定义的代码，并且数组已在其中重新成形
对于较大的阵列，这一点变得更加明显：
array1 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array2 = np.random.normal(0, 0.1, size=(1000, 1, 2))

array1_pares = array1.reshape(1000, 2)
array2_pares = arra2.reshape(1000, 2)

In [149]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
135 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [157]: %timeit pares(array1_pares, array2_pares)
1min 36s ± 6.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

最后，这受到可用系统内存的限制。我的机器（16GB RAM）仍然可以处理长度为20000的阵列，但这几乎将它推到了100%。它还需要大约12秒：
In [14]: array1 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [15]: array2 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [16]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
12.3 s ± 514 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

如前所述，缩放和舍入数字可能允许您使用intersect1d
或等效值
如果只有2列，则可以将其转换为复杂数据类型的1d数组
但您可能还需要记住intersect1d
的作用：
if not assume_unique:
    # Might be faster than unique( intersect1d( ar1, ar2 ) )?
    ar1 = unique(ar1)
    ar2 = unique(ar2)
aux = np.concatenate((ar1, ar2))
aux.sort()
return aux[:-1][aux[1:] == aux[:-1]]

unique
已增强以处理行（axis
参数），但intersect尚未增强。在任何情况下，它都使用argsort
将相似的元素相邻放置，然后跳过重复的元素
请注意，intersect
concatene表示唯一数组，排序，然后再次查找重复项
我知道您不想要循环版本，但为了促进问题的概念化，这里有一个：
In [581]: a = np.array([(1.22, 5.64),
     ...:    (2.31, 7.63),
     ...:    (4.94, 4.15)])
     ...: 
     ...: b = np.array([(1.23, 5.63),
     ...:    (6.31, 10.63),
     ...:    (2.32, 7.65)])
     ...:    

我删除了阵列中的一层嵌套
In [582]: c = []
In [583]: for a1 in a:
     ...:     for b1 in b:
     ...:         if np.allclose(a1,b1, atol=0.5): c.append((a1,b1))

或者作为列表理解
In [586]: [(a1,b1) for a1 in a for b1 in b if np.allclose(a1,b1,atol=0.5)]
Out[586]: 
[(array([1.22, 5.64]), array([1.23, 5.63])),
 (array([2.31, 7.63]), array([2.32, 7.65]))]

复近似
交叉启发
连接数组，对它们进行排序，获取差异，并找出细微差异：
In [616]: ab = np.concatenate((a,b),axis=0)
In [618]: np.lexsort(ab.T)
Out[618]: array([2, 3, 0, 1, 5, 4], dtype=int32)
In [619]: ab1 = ab[_,:]
In [620]: ab1
Out[620]: 
array([[ 4.94,  4.15],
       [ 1.23,  5.63],
       [ 1.22,  5.64],
       [ 2.31,  7.63],
       [ 2.32,  7.65],
       [ 6.31, 10.63]])
In [621]: ab1[1:]-ab1[:-1]
Out[621]: 
array([[-3.71,  1.48],
       [-0.01,  0.01],
       [ 1.09,  1.99],
       [ 0.01,  0.02],
       [ 3.99,  2.98]])

In [623]: ((ab1[1:]-ab1[:-1])<.1).all(axis=1)  # refine with abs
Out[623]: array([False,  True, False,  True, False])
In [626]: np.where(Out[623])
Out[626]: (array([1, 3], dtype=int32),)
In [627]: ab[_]
Out[627]: 
array([[2.31, 7.63],
       [1.23, 5.63]])

[616]中的：ab=np.连接（（a，b），轴=0）
In[618]：名词词法排序（ab.T）
Out[618]：数组（[2,3,0,1,5,4]，dtype=int32）
在[619]中：ab1=ab[_，：]
In[620]：ab1
出[620]：
数组（[[4.94,4.15]，
[ 1.23,  5.63],
[ 1.22,  5.64],
[ 2.31,  7.63],
[ 2.32,  7.65],
[ 6.31, 10.63]])
在[621]中：ab1[1:]-ab1[：-1]
出[621]：
数组（[-3.71,1.48]，
[-0.01,  0.01],
[ 1.09,  1.99],
[ 0.01,  0.02],
[ 3.99,  2.98]])
在[623]：（（ab1[1:]-ab1[：-1]）中，您可以尝试使用纯NP和自定义函数：
import numpy as np
#Your Example
xDA=np.array([[1.22, 5.64],[2.31, 7.63],[4.94, 4.15],[6.1,6.2]])
yDA=np.array([[1.23, 5.63],[6.31, 10.63],[2.32, 7.65],[3.1,9.2]])
###Try this large sample###
#xDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
#yDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)

print(xDA)
print(yDA)

#Match x to y
def np_matrix(myx,myy,calp=0.2):
    Xxx = np.transpose(np.repeat(myx[:, np.newaxis], myy.size, axis=1))
    Yyy = np.repeat(myy[:, np.newaxis], myx.size, axis=1)

    # define a caliper
    matches = {}
    dist = np.abs(Xxx - Yyy)
    for m in range(0, myx.size):
        if (np.min(dist[:, m]) <= calp) or not calp:
            matches[m] = np.argmin(dist[:, m])
    return matches


alwd_dist=0.1

xc1=xDA[:,1]
yc1=yDA[:,1]
m1=np_matrix(xc1,yc1,alwd_dist)
xc0=xDA[:,0]
yc0=yDA[:,0]
m0=np_matrix(xc0,yc0,alwd_dist)

shared_items = set(m1.items()) & set(m0.items())
if (int(len(shared_items))==0):
    print("No Matched Items based on given allowed distance:",alwd_dist)
else:
    print("Matched:")
    for ke in shared_items:
        print(xDA[ke[0]],yDA[ke[1]])

将numpy导入为np
#你的榜样
xDA=np.数组（[[1.22,5.64]，[2.31,7.63]，[4.94,4.15]，[6.1,6.2]]
yDA=np.array（[[1.23,5.63]，[6.31,10.63]，[2.32,7.65]，[3.1,9.2]）
###试试这个大样品###
#xDA=np.圆形（np.随机.均匀（1,2，大小=（5000,2）），2）
#yDA=np.圆形（np.随机.均匀（1,2，大小=（5000,2）），2）
打印（xDA）
印刷品（yDA）
#将x与y匹配
def np_矩阵（myx、myy、calp=0.2）：
Xxx=np.transpose（np.repeat（myx[：，np.newaxis]，myy.size，axis=1））
Yyy=np.repeat（myy[：，np.newaxis]，myx.size，axis=1）
#定义卡尺
匹配项={}
dist=np.abs（Xxx-Yyy）
对于范围内的m（0，myx.size）：
如果（np.min（dist[：，m]）有许多可能的方法来定义公差。因为，我们谈论的是XY坐标，很可能我们谈论的是欧几里德距离来设置公差值。因此，我们可以使用，这在内存方面和性能方面都非常有效。实现看起来像这样-
from scipy.spatial import cKDTree

# Assuming a default tolerance value of 1 here
def intersect_close(a, b, tol=1):
    # Get closest distances for each pt in b
    dist = cKDTree(a).query(b, k=1)[0] # k=1 selects closest one neighbor

    # Check the distances against the given tolerance value and 
    # thus filter out rows off b for the final output
    return b[dist <= tol]

当然可以尝试使用pandas库。它适用于大数据集，并具有内置的求交函数。也许您可以通过将小数np.四舍五入（array1，1）
或ceil
值np.ceil（array1）来近似计算结果发布的解决方案对您有效吗？首先，很抱歉反应太晚，感谢您提供了所有有用的方法。不幸的是，我无法在不修改初始问题的情况下使用其中任何一种方法。有些建议会耗费时间，而另一些则会耗费内存。不过，我将所有答案都标记为有用我尝试过，通常都是为了解决这个问题。@SebastianG，那么，你最终是如何解决你的问题的？你找到了比所有列出的解决方案都好的东西吗？如果是，你能分享吗？那么，我们需要1600GB的RAM来处理200000个pts，对吗？我想我们需要等待未来的到来。@Divakar好吧，或者使用better算法（如您的答案所示）。
In [616]: ab = np.concatenate((a,b),axis=0)
In [618]: np.lexsort(ab.T)
Out[618]: array([2, 3, 0, 1, 5, 4], dtype=int32)
In [619]: ab1 = ab[_,:]
In [620]: ab1
Out[620]: 
array([[ 4.94,  4.15],
       [ 1.23,  5.63],
       [ 1.22,  5.64],
       [ 2.31,  7.63],
       [ 2.32,  7.65],
       [ 6.31, 10.63]])
In [621]: ab1[1:]-ab1[:-1]
Out[621]: 
array([[-3.71,  1.48],
       [-0.01,  0.01],
       [ 1.09,  1.99],
       [ 0.01,  0.02],
       [ 3.99,  2.98]])

In [623]: ((ab1[1:]-ab1[:-1])<.1).all(axis=1)  # refine with abs
Out[623]: array([False,  True, False,  True, False])
In [626]: np.where(Out[623])
Out[626]: (array([1, 3], dtype=int32),)
In [627]: ab[_]
Out[627]: 
array([[2.31, 7.63],
       [1.23, 5.63]])

import numpy as np
#Your Example
xDA=np.array([[1.22, 5.64],[2.31, 7.63],[4.94, 4.15],[6.1,6.2]])
yDA=np.array([[1.23, 5.63],[6.31, 10.63],[2.32, 7.65],[3.1,9.2]])
###Try this large sample###
#xDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
#yDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)

print(xDA)
print(yDA)

#Match x to y
def np_matrix(myx,myy,calp=0.2):
    Xxx = np.transpose(np.repeat(myx[:, np.newaxis], myy.size, axis=1))
    Yyy = np.repeat(myy[:, np.newaxis], myx.size, axis=1)

    # define a caliper
    matches = {}
    dist = np.abs(Xxx - Yyy)
    for m in range(0, myx.size):
        if (np.min(dist[:, m]) <= calp) or not calp:
            matches[m] = np.argmin(dist[:, m])
    return matches


alwd_dist=0.1

xc1=xDA[:,1]
yc1=yDA[:,1]
m1=np_matrix(xc1,yc1,alwd_dist)
xc0=xDA[:,0]
yc0=yDA[:,0]
m0=np_matrix(xc0,yc0,alwd_dist)

shared_items = set(m1.items()) & set(m0.items())
if (int(len(shared_items))==0):
    print("No Matched Items based on given allowed distance:",alwd_dist)
else:
    print("Matched:")
    for ke in shared_items:
        print(xDA[ke[0]],yDA[ke[1]])

from scipy.spatial import cKDTree

# Assuming a default tolerance value of 1 here
def intersect_close(a, b, tol=1):
    # Get closest distances for each pt in b
    dist = cKDTree(a).query(b, k=1)[0] # k=1 selects closest one neighbor

    # Check the distances against the given tolerance value and 
    # thus filter out rows off b for the final output
    return b[dist <= tol]

# Input 2D arrays
In [68]: a
Out[68]: 
array([[1.22, 5.64],
       [2.31, 7.63],
       [4.94, 4.15]])

In [69]: b
Out[69]: 
array([[ 1.23,  5.63],
       [ 6.31, 10.63],
       [ 2.32,  7.65]])

# Get closest distances for each pt in b
In [70]: dist = cKDTree(a).query(b, k=1)[0]

In [71]: dist
Out[71]: array([0.01414214, 5.        , 0.02236068])

# Mask of distances within the given tolerance
In [72]: tol = 1

In [73]: dist <= tol
Out[73]: array([ True, False,  True])

# Finally filter out valid ones off b
In [74]: b[dist <= tol]
Out[74]: 
array([[1.23, 5.63],
       [2.32, 7.65]])

In [20]: N = 200000
    ...: np.random.seed(0)
    ...: a = np.random.rand(N,2)
    ...: b = np.random.rand(N,2)

In [21]: %timeit intersect_close(a, b)
1 loop, best of 3: 1.37 s per loop