Python 从向量的Numpy数组中删除重复项（在给定公差范围内）_Python_Sorting_Numpy

Python 从向量的Numpy数组中删除重复项（在给定公差范围内）

python sorting numpy

Python 从向量的Numpy数组中删除重复项（在给定公差范围内）,python,sorting,numpy,Python,Sorting,Numpy,我有一个Nx5数组，包含形式为'id'，'x'，'y'，'z'和'energy'的N个向量。我需要删除公差为0.1范围内的重复点（即x、y、z都匹配的位置）。理想情况下，我可以创建一个函数，在其中传入数组、需要匹配的列和匹配的容差接下来，我可以使用记录数组基于完整数组删除重复项，但我只需要匹配数组的一部分。此外，这在一定公差范围内不匹配我可以费力地用Python中的for循环进行迭代，但是有更好的Numponic方法吗？您可能会看到。 N有多大？添加：oops，树。查询对不在scipy 0

我有一个Nx5数组，包含形式为'id'，'x'，'y'，'z'和'energy'的N个向量。我需要删除公差为0.1范围内的重复点（即x、y、z都匹配的位置）。理想情况下，我可以创建一个函数，在其中传入数组、需要匹配的列和匹配的容差

接下来，我可以使用记录数组基于完整数组删除重复项，但我只需要匹配数组的一部分。此外，这在一定公差范围内不匹配

我可以费力地用Python中的

for

循环进行迭代，但是有更好的Numponic方法吗？

您可能会看到。 N有多大？
添加：oops，

树。查询对

不在scipy 0.7.1中

当有疑问时，使用蛮力：将空间（此处^3侧）分割成小单元，每个单元一个点：

""" scatter points to little cells, 1 per cell """
from __future__ import division         
import sys                              
import numpy as np                      

side = 100                              
npercell = 1  # 1: ~ 1/e empty          
exec "\n".join( sys.argv[1:] )  # side= ...
N = side**3 * npercell                  
print "side: %d  npercell: %d  N: %d" % (side, npercell, N)
np.random.seed( 1 )                     
points = np.random.uniform( 0, side, size=(N,3) )

cells = np.zeros( (side,side,side), dtype=np.uint )
id = 1
for p in points.astype(int):
    cells[tuple(p)] = id                
    id += 1                             

cells = cells.flatten()
    # A C, an E-flat, and a G walk into a bar. 
    # The bartender says, "Sorry, but we don't serve minors."
nz = np.nonzero(cells)[0]               
print "%d cells have points" % len(nz)
print "first few ids:", cells[nz][:10]

还没有测试过这个，但是如果您按照x然后y然后z对数组进行排序，这应该会得到重复的列表。然后，您需要选择保留哪个

def find_dup_xyz(anarray, x, y, z): #for example in an data = array([id,x,y,z,energy]) x=1 y=2 z=3
    dup_xyz=[]
    for i, row in enumerated(sortedArray):
        nx=1
        while (abs(row[x] - sortedArray[i+nx[x])<0.1) and (abs(row[z] and sortedArray[i+nx[y])<0.1) and (abs(row[z] - sortedArray[i+nx[z])<0.1):
              nx=+1
              dup_xyz.append(row)
return dup_xyz

def find_dup_xyz（anarray，x，y，z）：#例如，在数据=数组（[id，x，y，z，energy]）中x=1 y=2 z=3
dup_xyz=[]
对于i，枚举中的行（排序Darray）：
nx=1
虽然（abs（row[x]-sortedArray[i+nx[x]）我终于找到了一个我很满意的解决方案，但这是从我自己的代码中稍微整理的剪切粘贴。可能还有一些bug
注意：它仍然使用“for”循环。我可以使用上面Denis的KDTree思想和舍入来得到完整的解决方案
import numpy as np

def remove_duplicates(data, dp_tol=None, cols=None, sort_by=None):
    '''
    Removes duplicate vectors from a list of data points
    Parameters:
        data        An MxN array of N vectors of dimension M 
        cols        An iterable of the columns that must match 
                    in order to constitute a duplicate 
                    (default: [1,2,3] for typical Klist data array) 
        dp_tol      An iterable of three tolerances or a single 
                    tolerance for all dimensions. Uses this to round 
                    the values to specified number of decimal places 
                    before performing the removal. 
                    (default: None)
        sort_by     An iterable of columns to sort by (default: [0])

    Returns:
        MxI Array   An array of I vectors (minus the 
                    duplicates)

    EXAMPLES:

    Remove a duplicate

    >>> import wien2k.utils
    >>> import numpy as np
    >>> vecs1 = np.array([[1, 0, 0, 0],
    ...     [2, 0, 0, 0],
    ...     [3, 0, 0, 1]])
    >>> remove_duplicates(vecs1)
    array([[1, 0, 0, 0],
           [3, 0, 0, 1]])

    Remove duplicates with a tolerance

    >>> vecs2 = np.array([[1, 0, 0, 0  ],
    ...     [2, 0, 0, 0.001 ],
    ...     [3, 0, 0, 0.02  ],
    ...     [4, 0, 0, 1     ]])
    >>> remove_duplicates(vecs2, dp_tol=2)
    array([[ 1.  ,  0.  ,  0.  ,  0.  ],
           [ 3.  ,  0.  ,  0.  ,  0.02],
           [ 4.  ,  0.  ,  0.  ,  1.  ]])

    Remove duplicates and sort by k values

    >>> vecs3 = np.array([[1, 0, 0, 0],
    ...     [2, 0, 0, 2],
    ...     [3, 0, 0, 0],
    ...     [4, 0, 0, 1]])
    >>> remove_duplicates(vecs3, sort_by=[3])
    array([[1, 0, 0, 0],
           [4, 0, 0, 1],
           [2, 0, 0, 2]])

    Change the columns that constitute a duplicate

    >>> vecs4 = np.array([[1, 0, 0, 0],
    ...     [2, 0, 0, 2],
    ...     [1, 0, 0, 0],
    ...     [4, 0, 0, 1]])
    >>> remove_duplicates(vecs4, cols=[0])
    array([[1, 0, 0, 0],
           [2, 0, 0, 2],
           [4, 0, 0, 1]])

    '''
    # Deal with the parameters
    if sort_by is None:
        sort_by = [0]
    if cols is None:
        cols = [1,2,3]
    if dp_tol is not None:
        # test to see if already an iterable
        try:
            null = iter(dp_tol)
            tols = np.array(dp_tol)
        except TypeError:
            tols = np.ones_like(cols) * dp_tol
        # Convert to numbers of decimal places
        # Find the 'order' of the axes
    else:
        tols = None

    rnd_data = data.copy()
    # set the tolerances
    if tols is not None:
        for col,tol in zip(cols, tols):
            rnd_data[:,col] = np.around(rnd_data[:,col], decimals=tol)

    # TODO: For now, use a slow Python 'for' loop, try to find a more
    # numponic way later - see: http://stackoverflow.com/questions/2433882/
    sorted_indexes = np.lexsort(tuple([rnd_data[:,col] for col in cols]))
    rnd_data = rnd_data[sorted_indexes]
    unique_kpts = []
    for i in xrange(len(rnd_data)):
        if i == 0:
            unique_kpts.append(i)    
        else:
            if (rnd_data[i, cols] == rnd_data[i-1, cols]).all():
                continue
            else:
                unique_kpts.append(i)    

    rnd_data =  rnd_data[unique_kpts]
    # Now sort
    sorted_indexes = np.lexsort(tuple([rnd_data[:,col] for col in sort_by]))
    rnd_data = rnd_data[sorted_indexes]
    return rnd_data



if __name__ == '__main__':
    import doctest
    doctest.testmod()
你给出的规格有一个内在的问题，这就是为什么你不可能找到一个预先准备好的解决方案：为了清楚起见，公差实际上是0.11，y和z总是相同的，x
s是0，0.1，0.2，0.3，0.4，…--现在什么是“重复的”？根据你的定义，0.1是“重复的”0和0.2，但这两个都不是彼此的重复——因此“重复”关系是不可传递的，因此不可能产生分区！您需要自己定义一些启发式方法，因为没有真正“数学上正确”的解决方案（不可能是：无分区！）.我明白你的观点。在我正在处理的问题领域中，尽管我希望聚类，即簇内点之间的平均间距~公差，而簇间平均间距>>簇内点之间的平均间距。公差的大小应确保簇内的任何点都可以是“佳能”ical的观点。使用KDTree是一个好主意，我可能会在以后实现它