Python 如何创建平衡的k-means地理空间簇?

Python 如何创建平衡的k-means地理空间簇?,python,pandas,scikit-learn,cluster-analysis,geospatial,Python,Pandas,Scikit Learn,Cluster Analysis,Geospatial,我有9000个美国点,这些点是具有各种不同字符串和数字列/属性的帐户。我试图将这些点/账户平均划分为公平的分组,这些分组在空间上进行分组,并且在重力意义上根据员工数量进行加权,这是列/属性之一。我使用sklearn K-means聚类进行分组,它似乎工作得很好,但我注意到分组并不相等。其中一些群体约有600人,一些群体约有70人。这在某种程度上是合乎逻辑的,因为在某些领域有更多的数据。这里的问题是,我需要这些群体更加平等。以下是我使用的代码: kmeans = KMeans(n_clusters

我有9000个美国点,这些点是具有各种不同字符串和数字列/属性的帐户。我试图将这些点/账户平均划分为公平的分组,这些分组在空间上进行分组,并且在重力意义上根据员工数量进行加权,这是列/属性之一。我使用sklearn K-means聚类进行分组,它似乎工作得很好,但我注意到分组并不相等。其中一些群体约有600人,一些群体约有70人。这在某种程度上是合乎逻辑的,因为在某些领域有更多的数据。这里的问题是,我需要这些群体更加平等。以下是我使用的代码:

kmeans = KMeans(n_clusters = 30, max_iter=1000, init ='k-means++')

lat_long = dftobeclustered[dftobeclustered.columns[1:3]]
_employees = dftobeclustered[dftobeclustered.columns[3]]

weighted_kmeans_clusters = kmeans.fit(lat_long, sample_weight = _employees)
dftobeclustered['cluster_label'] = kmeans.predict(lat_long, sample_weight = _employees)

centers = kmeans.cluster_centers_ 

labels = dftobeclustered['cluster_label'] 

是否有可能以更平等的方式划分k-均值聚类?我认为核心问题是,它把像蒙大拿或夏威夷这样的低人口地区分成了自己的群体,而实际上我需要它把这些地区分成更大的群体。但我不知道。

K-means不是这样写的。观测值根据其与质心的实际测量距离分配给簇

如果您试图强制集群中的成员数量,那么它将完全取消该距离测量组件,特别是当您在地理上与Lat-Lon交谈时

你可能需要研究另一种方法来对你的观察结果进行子集划分,或者重新考虑簇的等效大小

老实说,大多数情况下,地理距离聚类与观察结果的相似性直接相关,比如房屋风格、人口统计或社区收入如何,以及这如何转化为本地区的邮政编码或树木类型。这类事情不尊重我们的需求,即它们是相同规模的群体

如果在偶数个观测值中有明显的差异,而不是在直线上的纬度上,则基于地理以外的质量的聚类更有可能趋于平稳,因为它们将按距离排序……无法绕过

因此,观测人口密集的地区将比观测人口较少的地区拥有更多的成员。MT和HI之间的距离将始终大于MT和NYC,因此它们不会在地理上按距离聚集

我知道你们想要平等的分组……是否有必要按地理位置分组?鉴于MT和HI将在一起,地理标签意义不大。最好使用所有非地理数值进行聚类,以创建上下文相似的观测值


否则,您可以使用业务规则来剖析观察结果,我的意思是,如果var_x>7&var_yK-means不是以这种方式编写的。观测值根据其与质心的实际测量距离分配给簇

如果您试图强制集群中的成员数量,那么它将完全取消该距离测量组件,特别是当您在地理上与Lat-Lon交谈时

你可能需要研究另一种方法来对你的观察结果进行子集划分,或者重新考虑簇的等效大小

老实说,大多数情况下,地理距离聚类与观察结果的相似性直接相关,比如房屋风格、人口统计或社区收入如何,以及这如何转化为本地区的邮政编码或树木类型。这类事情不尊重我们的需求,即它们是相同规模的群体

如果在偶数个观测值中有明显的差异,而不是在直线上的纬度上,则基于地理以外的质量的聚类更有可能趋于平稳,因为它们将按距离排序……无法绕过

因此,观测人口密集的地区将比观测人口较少的地区拥有更多的成员。MT和HI之间的距离将始终大于MT和NYC,因此它们不会在地理上按距离聚集

我知道你们想要平等的分组……是否有必要按地理位置分组?鉴于MT和HI将在一起,地理标签意义不大。最好使用所有非地理数值进行聚类,以创建上下文相似的观测值


否则,您可以使用业务规则来剖析观察结果,我的意思是如果var_x>7&var_yTry DBSCAN。请参阅下面的示例代码

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint


# define the number of kilometers in one radian
kms_per_radian = 6371.0088


# load the data set
df = pd.read_csv('C:\\your_path\\summer-travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()


# how many rows are in this data set?
len(df)


# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)

 

# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)

# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian


start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))


# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
    
    size = 150
    if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
        color = 'gray'
        size = 30
    
    # plot the points that match the current cluster label
    # X.iloc[:-1]
    # df.iloc[:, 0]
    x_coords = df_coords.iloc[:, 0]
    y_coords = df_coords.iloc[:, 1]
    ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)

ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()
结果:

Number of clusters: 138


# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()
0                  lat        lon
1587  37.921659  22...
1                  lat        lon
1658  37.933609  23...
2                  lat        lon
1607  37.966766  23...
3                  lat        lon
1586  38.149019  22...
4                  lat        lon
1584  38.374766  21...
                       
133              lat        lon
662  50.37369  18.889205
134               lat        lon
561  50.448704  19.0...
135               lat        lon
661  50.462271  19.0...
136               lat        lon
559  50.489304  19.0...
137             lat       lon
1  51.474005 -0.450999
结果:

Number of clusters: 138


# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()
0                  lat        lon
1587  37.921659  22...
1                  lat        lon
1658  37.933609  23...
2                  lat        lon
1607  37.966766  23...
3                  lat        lon
1586  38.149019  22...
4                  lat        lon
1584  38.374766  21...
                       
133              lat        lon
662  50.37369  18.889205
134               lat        lon
561  50.448704  19.0...
135               lat        lon
661  50.462271  19.0...
136               lat        lon
559  50.489304  19.0...
137             lat       lon
1  51.474005 -0.450999
数据来源:

相关资源:


试试看。请参阅下面的示例代码

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint


# define the number of kilometers in one radian
kms_per_radian = 6371.0088


# load the data set
df = pd.read_csv('C:\\your_path\\summer-travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()


# how many rows are in this data set?
len(df)


# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)

 

# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)

# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian


start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))


# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
    
    size = 150
    if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
        color = 'gray'
        size = 30
    
    # plot the points that match the current cluster label
    # X.iloc[:-1]
    # df.iloc[:, 0]
    x_coords = df_coords.iloc[:, 0]
    y_coords = df_coords.iloc[:, 1]
    ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)

ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()
结果:

Number of clusters: 138


# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()
0                  lat        lon
1587  37.921659  22...
1                  lat        lon
1658  37.933609  23...
2                  lat        lon
1607  37.966766  23...
3                  lat        lon
1586  38.149019  22...
4                  lat        lon
1584  38.374766  21...
                       
133              lat        lon
662  50.37369  18.889205
134               lat        lon
561  50.448704  19.0...
135               lat        lon
661  50.462271  19.0...
136               lat        lon
559  50.489304  19.0...
137             lat       lon
1  51.474005 -0.450999
结果:

Number of clusters: 138


# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()
0                  lat        lon
1587  37.921659  22...
1                  lat        lon
1658  37.933609  23...
2                  lat        lon
1607  37.966766  23...
3                  lat        lon
1586  38.149019  22...
4                  lat        lon
1584  38.374766  21...
                       
133              lat        lon
662  50.37369  18.889205
134               lat        lon
561  50.448704  19.0...
135               lat        lon
661  50.462271  19.0...
136               lat        lon
559  50.489304  19.0...
137             lat       lon
1  51.474005 -0.450999
数据来源:

相关资源: