Python 如何为数据帧行分配唯一标识符
我有一个Python 如何为数据帧行分配唯一标识符,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个.csv文件,该文件是在输入数据由sklearn.cluster.DBSCAN()处理后从nd.array创建的。我希望能够将集群中的每个点“绑定”到输入文件中一列给定的唯一标识符 这就是我读取输入数据的方式: # Generate sample data col_1 ="RL15_LONGITUDE" col_2 ="RL15_LATITUDE" data = pd.read_csv("2004_Charley_data.csv") coords = data.as_matrix(
.csv
文件,该文件是在输入数据由sklearn.cluster.DBSCAN()
处理后从nd.array
创建的。我希望能够将集群中的每个点“绑定”到输入文件中一列给定的唯一标识符
这就是我读取输入数据的方式:
# Generate sample data
col_1 ="RL15_LONGITUDE"
col_2 ="RL15_LATITUDE"
data = pd.read_csv("2004_Charley_data.csv")
coords = data.as_matrix(columns=[col_1, col_2])
data = data[[col_1,col_2]].dropna()
data = data.as_matrix().astype('float16',copy=False)
这就是它看起来的样子:
RecordID Storm RL15_LATITUDE RL15_LONGITUDE
2004_Charley95104-257448 2004_Charley 25.81774 -80.25079
2004_Charley93724-254950 2004_Charley 26.116338 -81.74986
2004_Charley93724-254949 2004_Charley 26.116338 -81.74986
2004_Charley75496-215198 2004_Charley 26.11817 -81.75756
在一些帮助下,我能够将DBSCAN
的输出保存为.CSV
文件,如下所示:
clusters = (pd.concat([pd.DataFrame(c, columns=[col_2,col_1]).assign(cluster=i)
for i,c in enumerate(clusters)])
.reset_index()
.rename(columns={'index':'point'})
.set_index(['cluster','point'])
)
clusters.to_csv('output.csv')
我现在的输出是多索引,但我想知道是否有方法可以将列点更改为RecordID
,而不仅仅是一个数字:
cluster point RL15_LATITUDE RL15_LONGITUDE
0 0 -81.0625 29.234375
0 1 -81.0625 29.171875
0 2 -81.0625 29.359375
1 0 -81.0625 29.25
1 1 -81.0625 29.21875
1 2 -81.0625 29.25
1 3 -81.0625 29.21875
更新:
代码:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)
cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13
db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df['cluster'] = labels
res = df[df.cluster >= 0]
print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())
输出:
--------------------------------
RecordID Storm RL15_LATITUDE RL15_LONGITUDE cluster
5 2004_Charley73944-211787 2004_Charley 29.228560 -81.034440 0
13 2004_Charley72308-208134 2004_Charley 29.442692 -81.109528 0
18 2004_Charley68044-198941 2004_Charley 29.442692 -81.109528 0
19 2004_Charley67753-198272 2004_Charley 29.270940 -81.097300 0
22 2004_Charley64829-191531 2004_Charley 29.313223 -81.101620 0
.. ... ... ... ... ...
812 2004_Charley94314-256039 2004_Charley 28.287827 -81.353285 1
813 2004_Charley93913-255344 2004_Charley 26.532980 -82.194400 7
814 2004_Charley93913-255346 2004_Charley 27.210467 -81.863720 5
815 2004_Charley93913-255357 2004_Charley 26.935550 -82.054447 4
816 2004_Charley93913-255354 2004_Charley 26.935550 -82.054447 4
[688 rows x 5 columns]
--------------------------------
1 217
0 170
2 145
4 94
7 18
6 16
5 14
3 14
Name: cluster, dtype: int64
旧答案:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)
cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13
db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df['cluster'] = labels
res = df[df.cluster >= 0]
print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())
如果我正确理解了您的代码,您可以这样做:
# read CSV (you have provided space-delimited file and with one unnamed column, so i have converted it to somewhat similar to that from your question)
fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn, sep='\s+', index_col=0)
df.index = df.index.values + df.RecordID.map(str)
del df['RecordID']
前10行:
In [148]: df.head(10)
Out[148]:
Storm RL15_LATITUDE RL15_LONGITUDE
RecordID
2004_Charley67146-196725 2004_Charley 33.807550 -78.701172
2004_Charley73944-211790 2004_Charley 33.618435 -78.993407
2004_Charley73944-211793 2004_Charley 28.609200 -80.818880
2004_Charley73944-211789 2004_Charley 29.383210 -81.160100
2004_Charley73944-211786 2004_Charley 33.691235 -78.895129
2004_Charley73944-211787 2004_Charley 29.228560 -81.034440
2004_Charley73944-211795 2004_Charley 28.357253 -80.701632
2004_Charley73944-211792 2004_Charley 34.204490 -77.924700
2004_Charley66636-195501 2004_Charley 33.436717 -79.132074
2004_Charley66631-195496 2004_Charley 33.646292 -78.977968
聚类:
cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13
db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
将集群信息设置为我们的DF-我们可以简单地将其分配为标签
与我们的DF具有相同的长度:
df['cluster'] = labels
筛选器:仅保留集群>=0的行:
res = df[df.cluster >= 0]
结果:
In [152]: res.head(10)
Out[152]:
Storm RL15_LATITUDE RL15_LONGITUDE cluster
RecordID
2004_Charley73944-211787 2004_Charley 29.228560 -81.034440 0
2004_Charley72308-208134 2004_Charley 29.442692 -81.109528 0
2004_Charley68044-198941 2004_Charley 29.442692 -81.109528 0
2004_Charley67753-198272 2004_Charley 29.270940 -81.097300 0
2004_Charley64829-191531 2004_Charley 29.313223 -81.101620 0
2004_Charley67376-197429 2004_Charley 29.196990 -80.993800 0
2004_Charley73720-211013 2004_Charley 29.171450 -81.037170 0
2004_Charley73705-210991 2004_Charley 28.308746 -81.424273 1
2004_Charley65157-192371 2004_Charley 28.308746 -81.424273 1
2004_Charley65126-192326 2004_Charley 28.308746 -81.424273 1
统计数据:
In [151]: res.cluster.value_counts()
Out[151]:
1 217
0 170
2 145
4 94
7 18
6 16
5 14
3 14
Name: cluster, dtype: int64
如果您不想将RecordID
作为索引:
In [153]: res = res.reset_index()
In [154]: res.head(10)
Out[154]:
RecordID Storm RL15_LATITUDE RL15_LONGITUDE cluster
0 2004_Charley73944-211787 2004_Charley 29.228560 -81.034440 0
1 2004_Charley72308-208134 2004_Charley 29.442692 -81.109528 0
2 2004_Charley68044-198941 2004_Charley 29.442692 -81.109528 0
3 2004_Charley67753-198272 2004_Charley 29.270940 -81.097300 0
4 2004_Charley64829-191531 2004_Charley 29.313223 -81.101620 0
5 2004_Charley67376-197429 2004_Charley 29.196990 -80.993800 0
6 2004_Charley73720-211013 2004_Charley 29.171450 -81.037170 0
7 2004_Charley73705-210991 2004_Charley 28.308746 -81.424273 1
8 2004_Charley65157-192371 2004_Charley 28.308746 -81.424273 1
9 2004_Charley65126-192326 2004_Charley 28.308746 -81.424273 1
这将是困难的,因为在聚类过程之后点信息似乎丢失了。。。如果所有点都有唯一的坐标,我们可以尝试将其合并回去……我想我也可以这样做,但是数据。as_matrix()
为每个坐标添加一组小数。例如,Lon=-81.74986
变成Lon=-81.7498626709
。否则,我可以使用excel来映射它,以便进行分析。如果有办法保持相同的数字,那么我可能不需要RecordID
我们可以比较这些数字-你的意思是比较最终的lat-lon和输入lat-lon,看看它们是否相等?是的,但我需要可复制的样本数据集…对不起,我出城了。我很难像你那样加载文件。您是否在任何地方指定了列数?我得到错误cparserror:error标记化数据。C错误:第3行应为5个字段,saw 6
。我认为这是sep=''的问题,我的文件有多少列会有区别吗?我查了一下错误,我读到可能是错误happening@rubito,不,我对你在pastebin.com上发布的文件没有任何问题。我只是用fn>再次检查了一下http://pastebin.com/raw/2f16tDNv“
使用我的答案中的代码-它工作正常是的,我认为这是我的错误。“我正在调查。”鲁比托,我喜欢你的方法!祝你演讲顺利!:)