Python 如何使用dbscan知道每个集群中的行数?
csv数据如下所示:Python 如何使用dbscan知道每个集群中的行数?,python,scikit-learn,Python,Scikit Learn,csv数据如下所示: device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_1 A0001,2020-08-05 05:10:05+00:00,23.140366,114.18685,0.0,,0,202008 A0001,2020-08-05 05:10:33+00:00,22.994716,114.2998,0.0,,0,202008 A0001,2020-08-05 05:20:55+00
device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_1
A0001,2020-08-05 05:10:05+00:00,23.140366,114.18685,0.0,,0,202008
A0001,2020-08-05 05:10:33+00:00,22.994716,114.2998,0.0,,0,202008
A0001,2020-08-05 05:20:55+00:00,22.994716,114.2998,0.0,,3.8,202008
A0001,2020-08-05 05:24:02+00:00,22.994916,114.299683,0.0,,2.1,202008
A0001,2020-08-05 05:24:30+00:00,22.99545,114.2998,0.0,,6.5,202008
A0001,2020-08-05 05:29:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:34:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:39:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:53+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:45:40+00:00,22.995433,114.299766,0.0,,5.8,202008
我使用csv中的经纬度数据生成dbscan聚类图像,每个聚类的颜色不同
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import pandas as pd
def draw_with_dbscan(para_csv_path_name,para_csv_name,para_save_path):
df = pd.read_csv(para_csv_path_name, encoding='utf-8', parse_dates=[1], low_memory=False)
X = df[['latitude', 'longitude']]
X = X.drop_duplicates()
kms_per_rad = 6371.0088 # mean radius of the earth
epsilon = 1.5 / kms_per_rad # The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. default=0.5
dbsc = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(X)))
fac_cluster_labels = dbsc.labels_
# get the number of clusters
num_clusters = len(set(dbsc.labels_))
# turn the clusters into a pandas series,where each element is a cluster of points
dbsc_clusters = pd.Series([X[fac_cluster_labels == n] for n in range(num_clusters)])
# get centroid of each cluster
fac_centroids = dbsc_clusters.map(get_centroid)
# unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
cent_lats, cent_lons = zip(*fac_centroids)
# from these lats/lons create a new df of one representative point for eac cluster
centroids_pd = pd.DataFrame({'longitude': cent_lons, 'latitude': cent_lats})
# Plot the faciity clusters and cluster centroid
fig, ax = plt.subplots(figsize=[20, 10])
facility_scatter = ax.scatter(X['longitude'], X['latitude'], c=fac_cluster_labels,
edgecolor='None', alpha=0.7, s=120)
centroid_scatter = ax.scatter(centroids_pd['longitude'], centroids_pd['latitude'], marker='x', linewidths=2,
c='k', s=50)
ax.set_title('Facility Clusters & Facility Centroid', fontsize=30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize=24)
ax.legend([facility_scatter, centroid_scatter], ['Facilities', 'Facility Cluster Centroid'], loc='upper right',
fontsize=20)
# plt.show()
plt.savefig(para_save_path + para_csv_name.split('.')[0] + '.png')
plt.close()
def get_centroid(cluster):
"""calculate the centroid of a cluster of geographic coordinate points
Args:
cluster coordinates, nx2 array-like (array, list of lists, etc)
n is the number of points(latitude, longitude)in the cluster.
Return:
geometry centroid of the cluster
"""
cluster_ary = np.asarray(cluster)
centroid = cluster_ary.mean(axis=0)
return centroid
if __name__ == '__main__':
csvlName=r'E:/mydata/test.csv'
item='test.csv'
abnormal_dbscan_device_img_dir=r'E:/result/'
draw_with_dbscan(csvlName, item, abnormal_dbscan_device_img_dir)
生成的图像如下所示:
但是如何使用dbscan知道每个集群中纬度和经度数据的行数?您可能希望尝试:
values = np.unique(fac_cluster_labels,return_counts=True)
{k:v for k,v in zip(*values)}
您可以尝试:
values = np.unique(fac_cluster_labels,return_counts=True)
{k:v for k,v in zip(*values)}
你能考虑一下与你的标题相关的期望输出吗?我已经修改了我的代码。仍然不清楚你有什么问题,但你可以尝试
value,counts=np.unique(fac\u cluster\u labels,return\u counts=True);{k:v代表k,v在zip(值、计数)}
例如,csv有100行经纬度数据,分为4个簇。紫色集群有10行数据,蓝色集群有20行数据,黄色集群有30行数据,绿色集群有40行数据。数据,我想知道每个集群有多少行数据。你尝试过我上面建议的代码吗?你能考虑一下与你的标题相关的期望输出吗?我已经修改了我的代码。仍然不清楚你有什么问题,但你可以尝试value,counts=np.unique(fac\u cluster\u labels,return\u counts=True);{k:v代表k,v在zip(值、计数)}
例如,csv有100行经纬度数据,分为4个簇。紫色集群有10行数据,蓝色集群有20行数据,黄色集群有30行数据,绿色集群有40行数据。数据,我想知道每个集群有多少行数据。你尝试过我上面建议的代码吗?是的,这就是我想要的。是的,这就是我想要的。