Python 如何使用dbscan知道每个集群中的行数？_Python_Scikit Learn

Python 如何使用dbscan知道每个集群中的行数？

python scikit-learn

Python 如何使用dbscan知道每个集群中的行数？,python,scikit-learn,Python,Scikit Learn,csv数据如下所示： device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_1 A0001,2020-08-05 05:10:05+00:00,23.140366,114.18685,0.0,,0,202008 A0001,2020-08-05 05:10:33+00:00,22.994716,114.2998,0.0,,0,202008 A0001,2020-08-05 05:20:55+00

csv数据如下所示：

device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_1
A0001,2020-08-05 05:10:05+00:00,23.140366,114.18685,0.0,,0,202008
A0001,2020-08-05 05:10:33+00:00,22.994716,114.2998,0.0,,0,202008
A0001,2020-08-05 05:20:55+00:00,22.994716,114.2998,0.0,,3.8,202008
A0001,2020-08-05 05:24:02+00:00,22.994916,114.299683,0.0,,2.1,202008
A0001,2020-08-05 05:24:30+00:00,22.99545,114.2998,0.0,,6.5,202008
A0001,2020-08-05 05:29:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:34:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:39:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:53+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:45:40+00:00,22.995433,114.299766,0.0,,5.8,202008

我使用csv中的经纬度数据生成dbscan聚类图像，每个聚类的颜色不同

import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import pandas as pd

def draw_with_dbscan(para_csv_path_name,para_csv_name,para_save_path):
    df = pd.read_csv(para_csv_path_name, encoding='utf-8', parse_dates=[1], low_memory=False)
    X = df[['latitude', 'longitude']]
    X = X.drop_duplicates()
    kms_per_rad = 6371.0088  # mean radius of the earth
    epsilon = 1.5 / kms_per_rad  # The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. default=0.5
    dbsc = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(X)))
    fac_cluster_labels = dbsc.labels_
    # get the number of clusters
    num_clusters = len(set(dbsc.labels_))
    # turn the clusters into a pandas series,where each element is a cluster of points
    dbsc_clusters = pd.Series([X[fac_cluster_labels == n] for n in range(num_clusters)])
    # get centroid of each cluster
    fac_centroids = dbsc_clusters.map(get_centroid)
    # unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
    cent_lats, cent_lons = zip(*fac_centroids)
    # from these lats/lons create a new df of one representative point for eac cluster
    centroids_pd = pd.DataFrame({'longitude': cent_lons, 'latitude': cent_lats})
    # Plot the faciity clusters and cluster centroid
    fig, ax = plt.subplots(figsize=[20, 10])
    facility_scatter = ax.scatter(X['longitude'], X['latitude'], c=fac_cluster_labels,
                                  edgecolor='None', alpha=0.7, s=120)
    centroid_scatter = ax.scatter(centroids_pd['longitude'], centroids_pd['latitude'], marker='x', linewidths=2,
                                  c='k', s=50)
    ax.set_title('Facility Clusters & Facility Centroid', fontsize=30)
    ax.set_xlabel('Longitude', fontsize=24)
    ax.set_ylabel('Latitude', fontsize=24)
    ax.legend([facility_scatter, centroid_scatter], ['Facilities', 'Facility Cluster Centroid'], loc='upper right',
              fontsize=20)
    # plt.show()
    plt.savefig(para_save_path + para_csv_name.split('.')[0] + '.png')
    plt.close()


def get_centroid(cluster):
    """calculate the centroid of a cluster of geographic coordinate points
    Args:
      cluster coordinates, nx2 array-like (array, list of lists, etc)
      n is the number of points(latitude, longitude)in the cluster.
    Return:
      geometry centroid of the cluster

    """
    cluster_ary = np.asarray(cluster)
    centroid = cluster_ary.mean(axis=0)
    return centroid



if __name__ == '__main__':
    csvlName=r'E:/mydata/test.csv'
    item='test.csv'
    abnormal_dbscan_device_img_dir=r'E:/result/'
    draw_with_dbscan(csvlName, item, abnormal_dbscan_device_img_dir)

生成的图像如下所示：

但是如何使用dbscan知道每个集群中纬度和经度数据的行数？

您可能希望尝试：

values = np.unique(fac_cluster_labels,return_counts=True)
{k:v for k,v in zip(*values)}

您可以尝试：

values = np.unique(fac_cluster_labels,return_counts=True)
{k:v for k,v in zip(*values)}

你能考虑一下与你的标题相关的期望输出吗？我已经修改了我的代码。仍然不清楚你有什么问题，但你可以尝试

value，counts=np.unique（fac\u cluster\u labels，return\u counts=True）；{k:v代表k，v在zip（值、计数）}

例如，csv有100行经纬度数据，分为4个簇。紫色集群有10行数据，蓝色集群有20行数据，黄色集群有30行数据，绿色集群有40行数据。数据，我想知道每个集群有多少行数据。你尝试过我上面建议的代码吗？你能考虑一下与你的标题相关的期望输出吗？我已经修改了我的代码。仍然不清楚你有什么问题，但你可以尝试

value，counts=np.unique（fac\u cluster\u labels，return\u counts=True）；{k:v代表k，v在zip（值、计数）}

例如，csv有100行经纬度数据，分为4个簇。紫色集群有10行数据，蓝色集群有20行数据，黄色集群有30行数据，绿色集群有40行数据。数据，我想知道每个集群有多少行数据。你尝试过我上面建议的代码吗？是的，这就是我想要的。是的，这就是我想要的。