Python 如何为数据帧行分配唯一标识符

Python 如何为数据帧行分配唯一标识符,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个.csv文件,该文件是在输入数据由sklearn.cluster.DBSCAN()处理后从nd.array创建的。我希望能够将集群中的每个点“绑定”到输入文件中一列给定的唯一标识符 这就是我读取输入数据的方式: # Generate sample data col_1 ="RL15_LONGITUDE" col_2 ="RL15_LATITUDE" data = pd.read_csv("2004_Charley_data.csv") coords = data.as_matrix(

我有一个
.csv
文件,该文件是在输入数据由
sklearn.cluster.DBSCAN()
处理后从
nd.array
创建的。我希望能够将集群中的每个点“绑定”到输入文件中一列给定的唯一标识符

这就是我读取输入数据的方式:

# Generate sample data
col_1 ="RL15_LONGITUDE"
col_2 ="RL15_LATITUDE"  
data = pd.read_csv("2004_Charley_data.csv")
coords = data.as_matrix(columns=[col_1, col_2])
data = data[[col_1,col_2]].dropna()
data = data.as_matrix().astype('float16',copy=False)
这就是它看起来的样子:

RecordID                  Storm         RL15_LATITUDE   RL15_LONGITUDE
2004_Charley95104-257448  2004_Charley  25.81774        -80.25079
2004_Charley93724-254950  2004_Charley  26.116338       -81.74986
2004_Charley93724-254949  2004_Charley  26.116338       -81.74986
2004_Charley75496-215198  2004_Charley  26.11817        -81.75756
在一些帮助下,我能够将
DBSCAN
的输出保存为
.CSV
文件,如下所示:

clusters = (pd.concat([pd.DataFrame(c, columns=[col_2,col_1]).assign(cluster=i)
for i,c in enumerate(clusters)])
.reset_index()
.rename(columns={'index':'point'})
.set_index(['cluster','point'])
)
clusters.to_csv('output.csv')
我现在的输出是多索引,但我想知道是否有方法可以将列点更改为
RecordID
,而不仅仅是一个数字:

cluster point   RL15_LATITUDE   RL15_LONGITUDE
0       0   -81.0625    29.234375
0       1   -81.0625    29.171875
0       2   -81.0625    29.359375
1       0   -81.0625    29.25
1       1   -81.0625    29.21875
1       2   -81.0625    29.25
1       3   -81.0625    29.21875
更新: 代码:

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)

cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

df['cluster'] = labels

res = df[df.cluster >= 0]

print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())
输出:

--------------------------------
                     RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
5    2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
13   2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
18   2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
19   2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
22   2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
..                        ...           ...            ...             ...      ...
812  2004_Charley94314-256039  2004_Charley      28.287827      -81.353285        1
813  2004_Charley93913-255344  2004_Charley      26.532980      -82.194400        7
814  2004_Charley93913-255346  2004_Charley      27.210467      -81.863720        5
815  2004_Charley93913-255357  2004_Charley      26.935550      -82.054447        4
816  2004_Charley93913-255354  2004_Charley      26.935550      -82.054447        4

[688 rows x 5 columns]
--------------------------------
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64
旧答案:

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)

cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

df['cluster'] = labels

res = df[df.cluster >= 0]

print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())
如果我正确理解了您的代码,您可以这样做:

# read CSV (you have provided space-delimited file and with one unnamed column, so i have converted it to somewhat similar to that from your question)
fn = r'D:\temp\.data\2004_Charley_data.csv'

df = pd.read_csv(fn, sep='\s+', index_col=0)
df.index = df.index.values + df.RecordID.map(str)
del df['RecordID']
前10行:

In [148]: df.head(10)
Out[148]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE
RecordID
2004_Charley67146-196725  2004_Charley      33.807550      -78.701172
2004_Charley73944-211790  2004_Charley      33.618435      -78.993407
2004_Charley73944-211793  2004_Charley      28.609200      -80.818880
2004_Charley73944-211789  2004_Charley      29.383210      -81.160100
2004_Charley73944-211786  2004_Charley      33.691235      -78.895129
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440
2004_Charley73944-211795  2004_Charley      28.357253      -80.701632
2004_Charley73944-211792  2004_Charley      34.204490      -77.924700
2004_Charley66636-195501  2004_Charley      33.436717      -79.132074
2004_Charley66631-195496  2004_Charley      33.646292      -78.977968
聚类:

cols = ['RL15_LONGITUDE','RL15_LATITUDE']

eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
将集群信息设置为我们的DF-我们可以简单地将其分配为
标签
与我们的DF具有相同的长度:

df['cluster'] = labels
筛选器:仅保留
集群>=0的行:

res = df[df.cluster >= 0]
结果:

In [152]: res.head(10)
Out[152]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
RecordID
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1
统计数据:

In [151]: res.cluster.value_counts()
Out[151]:
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64
如果您不想将
RecordID
作为索引:

In [153]: res = res.reset_index()

In [154]: res.head(10)
Out[154]:
                   RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
0  2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
1  2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2  2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
3  2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
4  2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
5  2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
6  2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
7  2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
8  2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
9  2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

这将是困难的,因为在聚类过程之后点信息似乎丢失了。。。如果所有点都有唯一的坐标,我们可以尝试将其合并回去……我想我也可以这样做,但是
数据。as_matrix()
为每个坐标添加一组小数。例如,
Lon=-81.74986
变成
Lon=-81.7498626709
。否则,我可以使用excel来映射它,以便进行分析。如果有办法保持相同的数字,那么我可能不需要
RecordID
我们可以比较这些数字-你的意思是比较最终的lat-lon和输入lat-lon,看看它们是否相等?是的,但我需要可复制的样本数据集…对不起,我出城了。我很难像你那样加载文件。您是否在任何地方指定了列数?我得到错误
cparserror:error标记化数据。C错误:第3行应为5个字段,saw 6
。我认为这是sep=''的问题,我的文件有多少列会有区别吗?我查了一下错误,我读到可能是错误happening@rubito,不,我对你在pastebin.com上发布的文件没有任何问题。我只是用
fn>再次检查了一下http://pastebin.com/raw/2f16tDNv“
使用我的答案中的代码-它工作正常是的,我认为这是我的错误。“我正在调查。”鲁比托,我喜欢你的方法!祝你演讲顺利!:)