围绕一个单独的点进行密度聚类-Python_Python_Cluster Analysis_Dbscan_Optics Algorithm

围绕一个单独的点进行密度聚类-Python

python

围绕一个单独的点进行密度聚类-Python,python,cluster-analysis,dbscan,optics-algorithm,Python,Cluster Analysis,Dbscan,Optics Algorithm,我的目标是根据xy点的接近程度对它们进行聚类。具体地说，分组彼此靠近的点。我还希望使用一个单独的参考点来对数据进行聚类注意：我有多组数据需要单独进行集群。例如，使用以下命令，项中的每个唯一值表示一组不同的数据。我可以有多组独特的数据，这些数据在稀疏性上各不相同。因此，任何通过预定数量集群的技术都是不现实的，因为我每次都必须手动检查适合度并调整适当的参数。因此，迄今为止最好的方法是某种形式的密度聚类（DBSCAN、光学）然而，当我对紧密结合在一起的点进行聚类时，我希望通过一些截止点来保持预期

我的目标是根据xy点的接近程度对它们进行聚类。具体地说，分组彼此靠近的点。我还希望使用一个单独的参考点来对数据进行聚类

注意：我有多组数据需要单独进行集群。例如，使用以下命令，
项
中的每个唯一值表示一组不同的数据。我可以有多组独特的数据，这些数据在稀疏性上各不相同。因此，任何通过预定数量集群的技术都是不现实的，因为我每次都必须手动检查适合度并调整适当的参数。

因此，迄今为止最好的方法是某种形式的密度聚类（DBSCAN、光学）

然而，当我对紧密结合在一起的点进行聚类时，我希望通过一些截止点来保持预期的聚类为球形。另一方面，我不想减少可到达区域太多，因为我缺少接近参考点和核心点的点，但是一个小的间隙丢弃了我希望包含的点

下面显示了下面的困境<代码>第1项表示可达性应如何降低，以确保参考品脱周围的聚集点为球形。而

第2项

显示了可达区域如何需要更高，以便包括密集区域内的点

我希望我可以调整一个参数或包括一个单独的功能，而不是强制它。因为参考点周围的密集区域可能会变化，我不愿意强制排除特定半径以外的每个点

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
import seaborn as sns
from sklearn.cluster import OPTICS

fig, ax = plt.subplots(figsize = (6,6))
ax.grid(False)

df = pd.DataFrame({   
    'Item' : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],                                
    'x' : [-4.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,10.0,-2.0,2.0,5.0,7.5,15.0,0.0,-22.0,-20.0,-20.0,-6.5,20.5,0.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0,-2.0,0.0,3.0,-3.0,-7.0,-7.5,-9.0,-4.0,1.5,-1.0,-5.0,-4.5,-3.7,15.0,-20.0,-22.0,-20.0,-20.0,-12.0,20.5,6.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
    'y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,0.0,0.0,-2.0,-2.0,-7.0,-0.5,-10.5,-7.5,0.0,16.0,-15.0,5.0,13.5,3.0,-20.0,2.0,-17.5,-15,19.0,20.0,4.0,-2.0,0.0,0.0,2.5,2.0,-1.5,5.0,0.0,3.5,2.0,-5.5,-6.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,6.0,-20.0,2.0,-17.5,-15,19.0,20.0],     
    'X_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0],
    'Y_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0],           
   })

# not spherical
df = df[df['Item'] == 1]

# spherical but reachable area too small
#df = df[df['Item'] == 2]

df['distance'] = np.sqrt((df['X_Ref'] - df['x'])**2 + (df['Y_Ref'] - df['y'])**2)

Y_sklearn = df[['x','y']].values

ax.scatter(df['x'], df['y'], marker = 'o', s = 5)
ax.scatter(df['X_Ref'], df['Y_Ref'], c = 'w', edgecolor = 'k', marker = 'o', s = 7.5, zorder = 2)

#clusterer = DBSCAN(eps = 7.5, min_samples = 3)
#labels_clusters = clusterer.fit_predict(Y_sklearn)

clusterer = OPTICS(min_samples = 2, xi = 0.25, min_cluster_size = 0.25, max_eps = 5)
clusterer.fit(Y_sklearn)
labels_clusters = clusterer.fit_predict(Y_sklearn)

#Add cluster labels as a new column to original DataFrame.
df['cluster'] = labels_clusters
df['cluster'] = df['cluster'].astype('category')

sns.scatterplot(data = df,
            x = 'x',
            y = 'y',
            hue = 'cluster',
            ax = ax,
            legend = 'full',                
            )

第1项：半径右侧的点应排除在核心点之外

第2项：半径内的点应包含在核心点中

我相信我们可以重新设计这个问题。我不确定集群方法是否是最好的

利用距离进行聚类对于df：

对于df1：

利用面积边际增长如前所述，我相信这个问题可以用边际面积的概念重新表述。我们每次添加的每一点都会以不同的方式增加成本

换句话说，对每个点使用

对于面积计算，我将用距离的二次幂表示

代码：

对于df：碎石片从屏幕上你可以看到点的数量变得太多了。我会说选择10分可能会很好。选择基于弯头方法

最终绘图：

对于df1：碎石板图：

按照肘部法标准，13个点可能是最佳点

最终绘图：

让我看看我是否遵循了它。a）您需要选择中心的簇，b）是圆形的，c）圆内vs圆外。对吗？基本上。不过，需要明确的是，使用聚类方法时，由于点的分布不同，先前数量的组不适用。所以我需要一些能处理这件事的东西。此外，半径的大小也会有所不同。为了清晰起见，我将数据框分开。本质上，我有许多不同分布的点集。我将在问题中澄清，但我无法手动查看每个实例以确定适当数量的集群/组。如果我手动检查每一个以确定正确的数量，我可能会使用此方法或GMM。很高兴这能帮助您：）如果您找到更好的方法，请发布它

""""
https://stackoverflow.com/questions/66099958/density-clustering-around-a-separate-point-python
"""
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
import seaborn as sns
from sklearn.cluster import OPTICS
from sklearn.cluster import MiniBatchKMeans, KMeans
import matplotlib.pyplot as plt

# not spherical
df = pd.DataFrame({
    'x' : [-4.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,12.0,-2.0,2.0,8.0,8.5,15.0,-20.0,-22.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
    'y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,0.0,0.0,-2.0,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,3.0,-20.0,2.0,-17.5,-15,19.0,20.0],
    'X_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
    'Y_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
       })

# spherical but reachable area too small
df1 = pd.DataFrame({
    'x' : [-2.0,0.0,2.0,-3.0,-7.0,-7.5,-9.0,-4.0,1.5,-1.0,-5.0,-4.5,-3.7,15.0,-20.0,-22.0,-20.0,-20.0,-15.0,20.5,8.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
    'y' : [4.0,-2.0,0.0,0.0,2.5,2.0,-2.0,5.0,0.0,3.5,2.0,-5.5,-6.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,5.0,-20.0,2.0,-17.5,-15,19.0,20.0],
    'X_Ref' : [-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0],
    'Y_Ref' : [-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0],
   })

#Distance calculations
df['distance'] = np.sqrt((df['X_Ref'] - df['x'])**2 + (df['Y_Ref'] - df['y'])**2)
def distance_func(df):
    return np.sqrt((df['X_Ref'] - df['x']) ** 2 + (df['Y_Ref'] - df['y']) ** 2)
df1['distance'] = distance_func(df1)

# Change this for the graphs
df = df1.copy()
Y_sklearn = df['distance'].values.reshape(-1, 1)
fig, ax = plt.subplots(figsize = (6,6))
ax.grid(False)
ax.scatter(df['x'], df['y'], marker = 'o', s = 5)
ax.scatter(df['X_Ref'], df['Y_Ref'], c = 'w', edgecolor = 'k', marker = 'o', s = 7.5, zorder = 2)
clusterer = KMeans(init='k-means++', n_clusters=2, n_init=10)
clusterer.fit(Y_sklearn)
labels_clusters = clusterer.fit_predict(Y_sklearn)

#Add cluster labels as a new column to original DataFrame.
df['cluster'] = labels_clusters
df['cluster'] = df['cluster'].astype('category')

sns.scatterplot(data = df,
            x = 'x',
            y = 'y',
            hue = 'cluster',
            ax = ax,
            legend = 'full',
            )

""""
https://stackoverflow.com/questions/66099958/density-clustering-around-a-separate-point-python
"""
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
import seaborn as sns
from sklearn.cluster import OPTICS
from sklearn.cluster import MiniBatchKMeans, KMeans
import matplotlib.pyplot as plt



# not spherical
df = pd.DataFrame({
    'x' : [-4.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,12.0,-2.0,2.0,8.0,8.5,15.0,-20.0,-22.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
    'y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,0.0,0.0,-2.0,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,3.0,-20.0,2.0,-17.5,-15,19.0,20.0],
    'X_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
    'Y_Ref' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
       })

# spherical but reachable area too small
df1 = pd.DataFrame({
    'x' : [-2.0,0.0,2.0,-3.0,-7.0,-7.5,-9.0,-4.0,1.5,-1.0,-5.0,-4.5,-3.7,15.0,-20.0,-22.0,-20.0,-20.0,-15.0,20.5,8.0,20.0,-20.0,-15.0,20.0,-15.0,-10.0],
    'y' : [4.0,-2.0,0.0,0.0,2.5,2.0,-2.0,5.0,0.0,3.5,2.0,-5.5,-6.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,5.0,-20.0,2.0,-17.5,-15,19.0,20.0],
    'X_Ref' : [-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0],
    'Y_Ref' : [-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0],
   })

df['distance'] = np.sqrt((df['X_Ref'] - df['x'])**2 + (df['Y_Ref'] - df['y'])**2)

def distance_func(df):
    return np.sqrt((df['X_Ref'] - df['x']) ** 2 + (df['Y_Ref'] - df['y']) ** 2)

df1['distance'] = distance_func(df1)

# To shiwtch from one dataset to another.
#df=df1.copy()
df['distance_2'] = df['distance']**2


df.sort_values('distance',inplace=True)
#pd.DataFrame(df['marginal_change'].values).plot()
aux = pd.DataFrame(df['distance_2'].values, columns=['distance ** 2'])
aux.plot()


fig, ax = plt.subplots(figsize = (6,6))
ax.grid(False)
ax.scatter(df['x'], df['y'], marker = 'o', s = 5)
ax.scatter(df['X_Ref'], df['Y_Ref'], c = 'w', edgecolor = 'k', marker = 'o', s = 7.5, zorder = 2)


selected_top=10
labels_clusters = np.zeros(df.shape[0])
labels_clusters[0:selected_top] =1

#Add cluster labels as a new column to original DataFrame.
df['cluster'] = labels_clusters
df['cluster'] = df['cluster'].astype('category')

sns.scatterplot(data = df,
            x = 'x',
            y = 'y',
            hue = 'cluster',
            ax = ax,
            legend = 'full',
            )