Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python DBS可以去除绘图中的噪声_Python_Matplotlib_Cluster Analysis_Dbscan - Fatal编程技术网

Python DBS可以去除绘图中的噪声

Python DBS可以去除绘图中的噪声,python,matplotlib,cluster-analysis,dbscan,Python,Matplotlib,Cluster Analysis,Dbscan,使用DBSCAN (DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine') 我已经聚集了一个纬度和经度对列表,然后使用matplotlib绘制了该列表。打印时,它包括“噪波”坐标,即未指定给已创建的270个簇之一的点。我想从绘图中删除噪波,只绘制满足指定要求的集群,但我不确定如何做到这一点。如何排除噪波(同样,那些未指定给簇的点) 下面是我用来聚类和绘图的代码: df = pd.read_cs

使用DBSCAN

(DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine')
我已经聚集了一个纬度和经度对列表,然后使用matplotlib绘制了该列表。打印时,它包括“噪波”坐标,即未指定给已创建的270个簇之一的点。我想从绘图中删除噪波,只绘制满足指定要求的集群,但我不确定如何做到这一点。如何排除噪波(同样,那些未指定给簇的点)

下面是我用来聚类和绘图的代码:

df = pd.read_csv('xxx.csv')

# define the number of kilometers in one radiation
# which will be used to convert esp from km to radiation
kms_per_rad = 6371.0088

# define a function to calculate the geographic coordinate
# centroid of a cluster of geographic points
# it will be used later to calculate the centroids of DBSCAN cluster
# because Scikit-learn DBSCAN cluster class does not come with centroid attribute.
def get_centroid(cluster):
"""calculate the centroid of a cluster of geographic coordinate points
Args:
  cluster coordinates, nx2 array-like (array, list of lists, etc)
  n is the number of points(latitude, longitude)in the cluster.
Return:
  geometry centroid of the cluster

"""
cluster_ary = np.asarray(cluster)
centroid = cluster_ary.mean(axis=0)
return centroid

# convert eps to radians for use by haversine
epsilon = 0.1/kms_per_rad #1.5=1.5km  1=1km  0.5=500m 0.25=250m   0.1=100m

# Extract intersection coordinates (latitude, longitude)
tweet_coords = df.as_matrix(columns=['latitude','longitude'])

start_time = time.time()
dbsc = (DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine')
    .fit(np.radians(tweet_coords)))

tweet_cluster_labels = dbsc.labels_

# get the number of clusters
num_clusters = len(set(dbsc.labels_))

# print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))
print('Silhouette coefficient:     {:0.03f}'.format(metrics.silhouette_score(tweet_coords, tweet_cluster_labels)))

# Turn the clusters into a pandas series,where each element is a cluster of points
dbsc_clusters = pd.Series([tweet_coords[tweet_cluster_labels==n] for n in  range(num_clusters)])

# get centroid of each cluster
cluster_centroids = dbsc_clusters.map(get_centroid)
# unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
cent_lats, cent_lons = zip(*cluster_centroids)
# from these lats/lons create a new df of one representative point for eac   cluster
centroids_df = pd.DataFrame({'longitude':cent_lons, 'latitude':cent_lats})
#print centroids_df

# Plot the clusters and cluster centroids
fig, ax = plt.subplots(figsize=[20, 12])
tweet_scatter = ax.scatter(df['longitude'], df['latitude'],   c=tweet_cluster_labels, cmap = cm.hot, edgecolor='None', alpha=0.25, s=50)
centroid_scatter = ax.scatter(centroids_df['longitude'], centroids_df['latitude'], marker='x', linewidths=2, c='k', s=50)
ax.set_title('Tweet Clusters & Cluser Centroids', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize = 24)
ax.legend([tweet_scatter, centroid_scatter], ['Tweets', 'Tweets Cluster Centroids'], loc='upper right', fontsize = 20)
plt.show()


黑点是噪声,那些未添加到DBSCAN输入定义的簇中的噪声,而有色点是簇。我的目标是只可视化集群。

将标签存储在原始数据框的附加列中

df['tweet_cluster_labels'] = tweet_cluster_labels
对数据帧进行过滤,使其仅包含非噪声点(噪声样本的标签为-1)

然后画出这些点

tweet_scatter = ax.scatter(df_filtered['longitude'], 
                df_filtered['latitude'],
                c=df_filtered.tweet_cluster_labels, 
                cmap=cm.hot, edgecolor='None', alpha=0.25, s=50)

我编辑了这篇文章,只包含了一个问题。噪波是没有使用该算法分配给簇的点。我希望这篇最新的编辑能够进一步澄清噪波,以及我试图实现的目标。我试着把结果的截图包括进来,但我的声望还不够高。图片很有用。你可以把它们放进去,它们会显示为链接。如果你通知我,我可以把它们放进去,我可以包括到imgur的链接
tweet_scatter = ax.scatter(df_filtered['longitude'], 
                df_filtered['latitude'],
                c=df_filtered.tweet_cluster_labels, 
                cmap=cm.hot, edgecolor='None', alpha=0.25, s=50)