Python 如何为数据帧行分配唯一标识符_Python_Pandas_Dataframe

Python 如何为数据帧行分配唯一标识符

python pandas dataframe

Python 如何为数据帧行分配唯一标识符,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个.csv文件，该文件是在输入数据由sklearn.cluster.DBSCAN（）处理后从nd.array创建的。我希望能够将集群中的每个点“绑定”到输入文件中一列给定的唯一标识符这就是我读取输入数据的方式： # Generate sample data col_1 ="RL15_LONGITUDE" col_2 ="RL15_LATITUDE" data = pd.read_csv("2004_Charley_data.csv") coords = data.as_matrix(

我有一个

.csv

文件，该文件是在输入数据由

sklearn.cluster.DBSCAN（）

处理后从

nd.array

创建的。我希望能够将集群中的每个点“绑定”到输入文件中一列给定的唯一标识符

这就是我读取输入数据的方式：

# Generate sample data
col_1 ="RL15_LONGITUDE"
col_2 ="RL15_LATITUDE"  
data = pd.read_csv("2004_Charley_data.csv")
coords = data.as_matrix(columns=[col_1, col_2])
data = data[[col_1,col_2]].dropna()
data = data.as_matrix().astype('float16',copy=False)

这就是它看起来的样子：

RecordID                  Storm         RL15_LATITUDE   RL15_LONGITUDE
2004_Charley95104-257448  2004_Charley  25.81774        -80.25079
2004_Charley93724-254950  2004_Charley  26.116338       -81.74986
2004_Charley93724-254949  2004_Charley  26.116338       -81.74986
2004_Charley75496-215198  2004_Charley  26.11817        -81.75756

在一些帮助下，我能够将

DBSCAN

的输出保存为

.CSV

文件，如下所示：

clusters = (pd.concat([pd.DataFrame(c, columns=[col_2,col_1]).assign(cluster=i)
for i,c in enumerate(clusters)])
.reset_index()
.rename(columns={'index':'point'})
.set_index(['cluster','point'])
)
clusters.to_csv('output.csv')

我现在的输出是多索引，但我想知道是否有方法可以将列点更改为

RecordID

，而不仅仅是一个数字：

cluster point   RL15_LATITUDE   RL15_LONGITUDE
0       0   -81.0625    29.234375
0       1   -81.0625    29.171875
0       2   -81.0625    29.359375
1       0   -81.0625    29.25
1       1   -81.0625    29.21875
1       2   -81.0625    29.25
1       3   -81.0625    29.21875

更新： 代码：

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)

cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

df['cluster'] = labels

res = df[df.cluster >= 0]

print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())

输出：

--------------------------------
                     RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
5    2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
13   2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
18   2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
19   2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
22   2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
..                        ...           ...            ...             ...      ...
812  2004_Charley94314-256039  2004_Charley      28.287827      -81.353285        1
813  2004_Charley93913-255344  2004_Charley      26.532980      -82.194400        7
814  2004_Charley93913-255346  2004_Charley      27.210467      -81.863720        5
815  2004_Charley93913-255357  2004_Charley      26.935550      -82.054447        4
816  2004_Charley93913-255354  2004_Charley      26.935550      -82.054447        4

[688 rows x 5 columns]
--------------------------------
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

旧答案：

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)

cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

df['cluster'] = labels

res = df[df.cluster >= 0]

print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())

如果我正确理解了您的代码，您可以这样做：

# read CSV (you have provided space-delimited file and with one unnamed column, so i have converted it to somewhat similar to that from your question)
fn = r'D:\temp\.data\2004_Charley_data.csv'

df = pd.read_csv(fn, sep='\s+', index_col=0)
df.index = df.index.values + df.RecordID.map(str)
del df['RecordID']

前10行：

In [148]: df.head(10)
Out[148]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE
RecordID
2004_Charley67146-196725  2004_Charley      33.807550      -78.701172
2004_Charley73944-211790  2004_Charley      33.618435      -78.993407
2004_Charley73944-211793  2004_Charley      28.609200      -80.818880
2004_Charley73944-211789  2004_Charley      29.383210      -81.160100
2004_Charley73944-211786  2004_Charley      33.691235      -78.895129
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440
2004_Charley73944-211795  2004_Charley      28.357253      -80.701632
2004_Charley73944-211792  2004_Charley      34.204490      -77.924700
2004_Charley66636-195501  2004_Charley      33.436717      -79.132074
2004_Charley66631-195496  2004_Charley      33.646292      -78.977968

聚类：

cols = ['RL15_LONGITUDE','RL15_LATITUDE']

eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

将集群信息设置为我们的DF-我们可以简单地将其分配为

标签

与我们的DF具有相同的长度：

df['cluster'] = labels

筛选器：仅保留

集群>=0的行：
res = df[df.cluster >= 0]

结果:
In [152]: res.head(10)
Out[152]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
RecordID
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

统计数据：
In [151]: res.cluster.value_counts()
Out[151]:
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

如果您不想将RecordID
作为索引：
In [153]: res = res.reset_index()

In [154]: res.head(10)
Out[154]:
                   RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
0  2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
1  2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2  2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
3  2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
4  2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
5  2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
6  2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
7  2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
8  2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
9  2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

这将是困难的，因为在聚类过程之后点信息似乎丢失了。。。如果所有点都有唯一的坐标，我们可以尝试将其合并回去……我想我也可以这样做，但是数据。as_matrix（）
为每个坐标添加一组小数。例如，Lon=-81.74986
变成Lon=-81.7498626709
。否则，我可以使用excel来映射它，以便进行分析。如果有办法保持相同的数字，那么我可能不需要RecordID
我们可以比较这些数字-你的意思是比较最终的lat-lon和输入lat-lon，看看它们是否相等？是的，但我需要可复制的样本数据集…对不起，我出城了。我很难像你那样加载文件。您是否在任何地方指定了列数？我得到错误cparserror:error标记化数据。C错误：第3行应为5个字段，saw 6
。我认为这是sep=''的问题，我的文件有多少列会有区别吗？我查了一下错误，我读到可能是错误happening@rubito，不，我对你在pastebin.com上发布的文件没有任何问题。我只是用fn>再次检查了一下http://pastebin.com/raw/2f16tDNv“
使用我的答案中的代码-它工作正常是的，我认为这是我的错误。“我正在调查。”鲁比托，我喜欢你的方法！祝你演讲顺利！：）