Python 从向量的Numpy数组中删除重复项(在给定公差范围内)
我有一个Nx5数组,包含形式为'id','x','y','z'和'energy'的N个向量。我需要删除公差为0.1范围内的重复点(即x、y、z都匹配的位置)。理想情况下,我可以创建一个函数,在其中传入数组、需要匹配的列和匹配的容差 接下来,我可以使用记录数组基于完整数组删除重复项,但我只需要匹配数组的一部分。此外,这在一定公差范围内不匹配 我可以费力地用Python中的Python 从向量的Numpy数组中删除重复项(在给定公差范围内),python,sorting,numpy,Python,Sorting,Numpy,我有一个Nx5数组,包含形式为'id','x','y','z'和'energy'的N个向量。我需要删除公差为0.1范围内的重复点(即x、y、z都匹配的位置)。理想情况下,我可以创建一个函数,在其中传入数组、需要匹配的列和匹配的容差 接下来,我可以使用记录数组基于完整数组删除重复项,但我只需要匹配数组的一部分。此外,这在一定公差范围内不匹配 我可以费力地用Python中的for循环进行迭代,但是有更好的Numponic方法吗?您可能会看到。 N有多大? 添加:oops,树。查询对不在scipy 0
for
循环进行迭代,但是有更好的Numponic方法吗?您可能会看到。
N有多大?添加:oops,
树。查询对
不在scipy 0.7.1中
当有疑问时,使用蛮力:将空间(此处^3侧)分割成小单元,
每个单元一个点:
""" scatter points to little cells, 1 per cell """
from __future__ import division
import sys
import numpy as np
side = 100
npercell = 1 # 1: ~ 1/e empty
exec "\n".join( sys.argv[1:] ) # side= ...
N = side**3 * npercell
print "side: %d npercell: %d N: %d" % (side, npercell, N)
np.random.seed( 1 )
points = np.random.uniform( 0, side, size=(N,3) )
cells = np.zeros( (side,side,side), dtype=np.uint )
id = 1
for p in points.astype(int):
cells[tuple(p)] = id
id += 1
cells = cells.flatten()
# A C, an E-flat, and a G walk into a bar.
# The bartender says, "Sorry, but we don't serve minors."
nz = np.nonzero(cells)[0]
print "%d cells have points" % len(nz)
print "first few ids:", cells[nz][:10]
还没有测试过这个,但是如果您按照x然后y然后z对数组进行排序,这应该会得到重复的列表。然后,您需要选择保留哪个
def find_dup_xyz(anarray, x, y, z): #for example in an data = array([id,x,y,z,energy]) x=1 y=2 z=3
dup_xyz=[]
for i, row in enumerated(sortedArray):
nx=1
while (abs(row[x] - sortedArray[i+nx[x])<0.1) and (abs(row[z] and sortedArray[i+nx[y])<0.1) and (abs(row[z] - sortedArray[i+nx[z])<0.1):
nx=+1
dup_xyz.append(row)
return dup_xyz
def find_dup_xyz(anarray,x,y,z):#例如,在数据=数组([id,x,y,z,energy])中x=1 y=2 z=3
dup_xyz=[]
对于i,枚举中的行(排序Darray):
nx=1
虽然(abs(row[x]-sortedArray[i+nx[x])我终于找到了一个我很满意的解决方案,但这是从我自己的代码中稍微整理的剪切粘贴。可能还有一些bug
注意:它仍然使用“for”循环。我可以使用上面Denis的KDTree思想和舍入来得到完整的解决方案
import numpy as np
def remove_duplicates(data, dp_tol=None, cols=None, sort_by=None):
'''
Removes duplicate vectors from a list of data points
Parameters:
data An MxN array of N vectors of dimension M
cols An iterable of the columns that must match
in order to constitute a duplicate
(default: [1,2,3] for typical Klist data array)
dp_tol An iterable of three tolerances or a single
tolerance for all dimensions. Uses this to round
the values to specified number of decimal places
before performing the removal.
(default: None)
sort_by An iterable of columns to sort by (default: [0])
Returns:
MxI Array An array of I vectors (minus the
duplicates)
EXAMPLES:
Remove a duplicate
>>> import wien2k.utils
>>> import numpy as np
>>> vecs1 = np.array([[1, 0, 0, 0],
... [2, 0, 0, 0],
... [3, 0, 0, 1]])
>>> remove_duplicates(vecs1)
array([[1, 0, 0, 0],
[3, 0, 0, 1]])
Remove duplicates with a tolerance
>>> vecs2 = np.array([[1, 0, 0, 0 ],
... [2, 0, 0, 0.001 ],
... [3, 0, 0, 0.02 ],
... [4, 0, 0, 1 ]])
>>> remove_duplicates(vecs2, dp_tol=2)
array([[ 1. , 0. , 0. , 0. ],
[ 3. , 0. , 0. , 0.02],
[ 4. , 0. , 0. , 1. ]])
Remove duplicates and sort by k values
>>> vecs3 = np.array([[1, 0, 0, 0],
... [2, 0, 0, 2],
... [3, 0, 0, 0],
... [4, 0, 0, 1]])
>>> remove_duplicates(vecs3, sort_by=[3])
array([[1, 0, 0, 0],
[4, 0, 0, 1],
[2, 0, 0, 2]])
Change the columns that constitute a duplicate
>>> vecs4 = np.array([[1, 0, 0, 0],
... [2, 0, 0, 2],
... [1, 0, 0, 0],
... [4, 0, 0, 1]])
>>> remove_duplicates(vecs4, cols=[0])
array([[1, 0, 0, 0],
[2, 0, 0, 2],
[4, 0, 0, 1]])
'''
# Deal with the parameters
if sort_by is None:
sort_by = [0]
if cols is None:
cols = [1,2,3]
if dp_tol is not None:
# test to see if already an iterable
try:
null = iter(dp_tol)
tols = np.array(dp_tol)
except TypeError:
tols = np.ones_like(cols) * dp_tol
# Convert to numbers of decimal places
# Find the 'order' of the axes
else:
tols = None
rnd_data = data.copy()
# set the tolerances
if tols is not None:
for col,tol in zip(cols, tols):
rnd_data[:,col] = np.around(rnd_data[:,col], decimals=tol)
# TODO: For now, use a slow Python 'for' loop, try to find a more
# numponic way later - see: http://stackoverflow.com/questions/2433882/
sorted_indexes = np.lexsort(tuple([rnd_data[:,col] for col in cols]))
rnd_data = rnd_data[sorted_indexes]
unique_kpts = []
for i in xrange(len(rnd_data)):
if i == 0:
unique_kpts.append(i)
else:
if (rnd_data[i, cols] == rnd_data[i-1, cols]).all():
continue
else:
unique_kpts.append(i)
rnd_data = rnd_data[unique_kpts]
# Now sort
sorted_indexes = np.lexsort(tuple([rnd_data[:,col] for col in sort_by]))
rnd_data = rnd_data[sorted_indexes]
return rnd_data
if __name__ == '__main__':
import doctest
doctest.testmod()
你给出的规格有一个内在的问题,这就是为什么你不可能找到一个预先准备好的解决方案:为了清楚起见,公差实际上是0.11,y和z总是相同的,x
s是0,0.1,0.2,0.3,0.4,…--现在什么是“重复的”?根据你的定义,0.1是“重复的”0和0.2,但这两个都不是彼此的重复——因此“重复”关系是不可传递的,因此不可能产生分区!您需要自己定义一些启发式方法,因为没有真正“数学上正确”的解决方案(不可能是:无分区!).我明白你的观点。在我正在处理的问题领域中,尽管我希望聚类,即簇内点之间的平均间距~公差,而簇间平均间距>>簇内点之间的平均间距。公差的大小应确保簇内的任何点都可以是“佳能”ical的观点。使用KDTree是一个好主意,我可能会在以后实现它