Python 如何计算从一个数据帧到另一个数据帧的距离?

Python 如何计算从一个数据帧到另一个数据帧的距离?,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,假设我有一个由点组成的数据框: df1: 我还有另一个点的数据框架: df2: 是否有必要通过df one查看哪个点最接近df2内部,并将标签替换为最接近so的点的标签 结果我想: x y z label 1.1 2.1 3.1 2 4.1 5.1 6.1 0 7.1 8.1 9.1 1 谢谢你阅读我的问题 我只能从scipy from scipy.spatial import distance df1['label']=df2.label.iloc[distance.

假设我有一个由点组成的数据框:

df1:

我还有另一个点的数据框架:

df2:

是否有必要通过df one查看哪个点最接近df2内部,并将标签替换为最接近so的点的标签

结果我想:

x   y    z  label
1.1 2.1 3.1   2
4.1 5.1 6.1   0
7.1 8.1 9.1   1

谢谢你阅读我的问题

我只能从
scipy

from scipy.spatial import distance
df1['label']=df2.label.iloc[distance.cdist(df1.iloc[:,:-1], df2.iloc[:,:-1], metric='euclidean').argmin(1)].values
df1
Out[446]: 
     x    y    z  label
0  1.1  2.1  3.1      2
1  4.1  5.1  6.1      0
2  7.1  8.1  9.1      1
按“X”索引对其排序,然后比较$result数组 这将查找表之间最接近的数字

ABS函数返回一个绝对数,因此只要df2上有整数,它就是一个很好的解决方案


希望能有所帮助。

这里有一个使用kd树的版本,对于大型数据集来说,这可能要快得多

import numpy as np
import pandas as pd
from  sklearn.neighbors import KDTree
np.random.seed(0)
#since you have df1 and df2, you will want to convert the dfs to array here with
#X=df1['x'.'y','z'].to_numpy()
#Y=df2['x','y','z'].to_numpy()
X = np.random.random((10, 3))  # 10 points in 3 dimensions
Y = np.random.random((10, 3))
tree = KDTree(Y, leaf_size=2)  


#loop though the x array and find the closest point in y to each x  
#note the you can find as many as k nearest neighbors by this method
#though yours only calls for the k=1 case
dist, ind = tree.query(X, k=1) 

df1=pd.DataFrame(X, columns=['x','y','z']) 

#set the labels to the closest point to each neighbor
df1['label']=ind 

#this is cheesy, but it removes the list brackets 
#get rid of the following line if you want more than k=1 nearest neighbors
df1['label']=df1['label'].str.get(0).str.get(0)  
print(df1)

df1:
          x         y         z
0  0.548814  0.715189  0.602763
1  0.544883  0.423655  0.645894
2  0.437587  0.891773  0.963663
3  0.383442  0.791725  0.528895
4  0.568045  0.925597  0.071036
5  0.087129  0.020218  0.832620
6  0.778157  0.870012  0.978618
7  0.799159  0.461479  0.780529
8  0.118274  0.639921  0.143353
9  0.944669  0.521848  0.414662
df2:
          x         y         z
0  0.264556  0.774234  0.456150
1  0.568434  0.018790  0.617635
2  0.612096  0.616934  0.943748
3  0.681820  0.359508  0.437032
4  0.697631  0.060225  0.666767
5  0.670638  0.210383  0.128926
6  0.315428  0.363711  0.570197
7  0.438602  0.988374  0.102045
8  0.208877  0.161310  0.653108
9  0.253292  0.466311  0.244426

Out:
          x         y         z  label
0  0.548814  0.715189  0.602763      0
1  0.544883  0.423655  0.645894      6
2  0.437587  0.891773  0.963663      2
3  0.383442  0.791725  0.528895      0
4  0.568045  0.925597  0.071036      7
5  0.087129  0.020218  0.832620      8
6  0.778157  0.870012  0.978618      2
7  0.799159  0.461479  0.780529      2
8  0.118274  0.639921  0.143353      9
9  0.944669  0.521848  0.414662      3
这是一张你可以用来检查结果的图片。蓝色点是x点,橙色点是y点。

以下是使用matplotlib 3.0.2版绘制的代码

fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:,0],X[:,1],X[:,2])
ax.scatter(Y[:,0],Y[:,1],Y[:,2])
for i in range(len(X)): #plot each point + it's index as text above
    ax.text(X[i,0],X[i,1],X[i,2],  '%s' % (str(i)), size=20, zorder=1, color='blue') 
for i in range(len(Y)): #plot each point + it's index as text above
    ax.text(Y[i,0],Y[i,1],Y[i,2],  '%s' % (str(i)), size=20, zorder=1, color='orange') 

我的第一个回答回答了这个问题,但是OP想要一个任意数量维度的通用解决方案,而不仅仅是三个维度

import numpy as np
import pandas as pd
from  sklearn.neighbors import KDTree


np.random.seed(0)
#since you have df1 and df2, you will want to convert the dfs to array here with
#X=df1['x'.'y','z'].to_numpy()
#Y=df2['x','y','z'.to_numpy()
n=11    #n=number of dimensions in your sample
X = np.random.random((10, n))  # 10 points in n dimensions
Y = np.random.random((10, n))
tree = KDTree(Y, leaf_size=2)  

indices=[]
#for i in range(len(X)):
    #loop though the x array and find the closest point in y to each x       
dist, ind = tree.query(X, k=1) 
#indices.append(ind)     
df1=pd.DataFrame(X)  
##set the labels to the closest point to each neighbor
df1['label']=ind 

你想要的结果现在在df1中,但是你不能轻易地绘制它,或者在没有疯狂的大脑的情况下解释它。基于此处发布的3d版本的成功证明。

定义最近的版本?方程式是什么?求一行差的和?我的意思是从一个简单的距离公式。到目前为止,实例点(1,2,3)是点(1.1,2.1,3.1)BVOM的最接近点,这个问题是否得到了回答?这可能会导致计算成本增加。如果在两个大型阵列上执行此操作,可能需要实现kd树。如果我的点是11-d,我的情况是否仍然是k=1?因为我得到了一个“查询数据维度必须与培训数据维度相匹配”,你能给出一些例子+错误吗?如果你只想要最近的邻居,k应该是1,11-d不会影响这个。简单地说,就是您希望为每个点返回多少个邻居。如果要向查询提供单个点,则只需对其进行重塑。假设我只想看X中第一个点的最近邻点。然后你按照如下方式重塑“dist,ind=tree.query(X[0,:])。重塑(1,-1,k=1)。”kd树应该可以处理任意数量的欧几里德维度。我刚刚用11-d空间测试了我的代码,它可以工作,但你需要去掉这一行之后的所有内容dist,ind=tree.query(X,k=1)“因为熊猫和绘图部分的代码是假设三维空间编写的。很明显,你不能用一种容易理解的方式绘制11-d空间,例如,尽管你可以绘制横截面。当你说在那一行之后去掉所有的东西时,我怎么还能展示新的标签呢?因为如果在那之后去掉所有的东西,我只有数据树,仅此而已,对不起,我只是想理解一下
import numpy as np
import pandas as pd
from  sklearn.neighbors import KDTree
np.random.seed(0)
#since you have df1 and df2, you will want to convert the dfs to array here with
#X=df1['x'.'y','z'].to_numpy()
#Y=df2['x','y','z'].to_numpy()
X = np.random.random((10, 3))  # 10 points in 3 dimensions
Y = np.random.random((10, 3))
tree = KDTree(Y, leaf_size=2)  


#loop though the x array and find the closest point in y to each x  
#note the you can find as many as k nearest neighbors by this method
#though yours only calls for the k=1 case
dist, ind = tree.query(X, k=1) 

df1=pd.DataFrame(X, columns=['x','y','z']) 

#set the labels to the closest point to each neighbor
df1['label']=ind 

#this is cheesy, but it removes the list brackets 
#get rid of the following line if you want more than k=1 nearest neighbors
df1['label']=df1['label'].str.get(0).str.get(0)  
print(df1)

df1:
          x         y         z
0  0.548814  0.715189  0.602763
1  0.544883  0.423655  0.645894
2  0.437587  0.891773  0.963663
3  0.383442  0.791725  0.528895
4  0.568045  0.925597  0.071036
5  0.087129  0.020218  0.832620
6  0.778157  0.870012  0.978618
7  0.799159  0.461479  0.780529
8  0.118274  0.639921  0.143353
9  0.944669  0.521848  0.414662
df2:
          x         y         z
0  0.264556  0.774234  0.456150
1  0.568434  0.018790  0.617635
2  0.612096  0.616934  0.943748
3  0.681820  0.359508  0.437032
4  0.697631  0.060225  0.666767
5  0.670638  0.210383  0.128926
6  0.315428  0.363711  0.570197
7  0.438602  0.988374  0.102045
8  0.208877  0.161310  0.653108
9  0.253292  0.466311  0.244426

Out:
          x         y         z  label
0  0.548814  0.715189  0.602763      0
1  0.544883  0.423655  0.645894      6
2  0.437587  0.891773  0.963663      2
3  0.383442  0.791725  0.528895      0
4  0.568045  0.925597  0.071036      7
5  0.087129  0.020218  0.832620      8
6  0.778157  0.870012  0.978618      2
7  0.799159  0.461479  0.780529      2
8  0.118274  0.639921  0.143353      9
9  0.944669  0.521848  0.414662      3
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:,0],X[:,1],X[:,2])
ax.scatter(Y[:,0],Y[:,1],Y[:,2])
for i in range(len(X)): #plot each point + it's index as text above
    ax.text(X[i,0],X[i,1],X[i,2],  '%s' % (str(i)), size=20, zorder=1, color='blue') 
for i in range(len(Y)): #plot each point + it's index as text above
    ax.text(Y[i,0],Y[i,1],Y[i,2],  '%s' % (str(i)), size=20, zorder=1, color='orange') 
import numpy as np
import pandas as pd
from  sklearn.neighbors import KDTree


np.random.seed(0)
#since you have df1 and df2, you will want to convert the dfs to array here with
#X=df1['x'.'y','z'].to_numpy()
#Y=df2['x','y','z'.to_numpy()
n=11    #n=number of dimensions in your sample
X = np.random.random((10, n))  # 10 points in n dimensions
Y = np.random.random((10, n))
tree = KDTree(Y, leaf_size=2)  

indices=[]
#for i in range(len(X)):
    #loop though the x array and find the closest point in y to each x       
dist, ind = tree.query(X, k=1) 
#indices.append(ind)     
df1=pd.DataFrame(X)  
##set the labels to the closest point to each neighbor
df1['label']=ind