Python 如何根据数据帧的条件增加计数器？_Python_Pandas

Python 如何根据数据帧的条件增加计数器？

python pandas

Python 如何根据数据帧的条件增加计数器？,python,pandas,Python,Pandas,我有一堆记录，每个记录都标有一个集群值原始数据帧，df： +-------------+---------+ | measurement | cluster | +-------------+---------+ | M1 | 6 | | M2 | 6 | | M3 | 6 | | M4 | 12 | | M5 | 12 | | M6

我有一堆记录，每个记录都标有一个集群值

原始数据帧，df：

+-------------+---------+
| measurement | cluster |
+-------------+---------+
| M1          |       6 |
| M2          |       6 |
| M3          |       6 |
| M4          |      12 |
| M5          |      12 |
| M6          |      12 |
| M7          |       2 |
| M8          |       9 |
| M9          |       9 |
| M10         |       9 |
| M11         |       9 |
+-------------+---------+

+-------------+---------+-------------+
| measurement | cluster | new_cluster |
+-------------+---------+-------------+
| M1          |       6 |           1 |
| M2          |       6 |           1 |
| M3          |       6 |           1 |
| M4          |      12 |           2 |
| M5          |      12 |           2 |
| M6          |      12 |           2 |
| M7          |       2 |           x |
| M8          |       9 |           3 |
| M9          |       9 |           3 |
| M10         |       9 |           3 |
| M11         |       9 |           3 |
+-------------+---------+-------------+

如何根据当前群集值是否等于上一个群集值和下一个群集值，将群集重命名为新的编号，同时将其分配给群集值不等于上一个或下一个群集值的“x”行

所需的df：

+-------------+---------+
| measurement | cluster |
+-------------+---------+
| M1          |       6 |
| M2          |       6 |
| M3          |       6 |
| M4          |      12 |
| M5          |      12 |
| M6          |      12 |
| M7          |       2 |
| M8          |       9 |
| M9          |       9 |
| M10         |       9 |
| M11         |       9 |
+-------------+---------+

+-------------+---------+-------------+
| measurement | cluster | new_cluster |
+-------------+---------+-------------+
| M1          |       6 |           1 |
| M2          |       6 |           1 |
| M3          |       6 |           1 |
| M4          |      12 |           2 |
| M5          |      12 |           2 |
| M6          |      12 |           2 |
| M7          |       2 |           x |
| M8          |       9 |           3 |
| M9          |       9 |           3 |
| M10         |       9 |           3 |
| M11         |       9 |           3 |
+-------------+---------+-------------+

用于按掩码过滤的值：

m = df['cluster'].ne(df['cluster'].shift()).cumsum().duplicated(keep=False)
df.loc[m, 'new_cluster'] =  pd.factorize(df.loc[m, 'cluster'])[0] + 1
print (df)
   measurement  cluster  new_cluster
0           M1        6          1.0
1           M2        6          1.0
2           M3        6          1.0
3           M4       12          2.0
4           M5       12          2.0
5           M6       12          2.0
6           M7        2          NaN
7           M8        9          3.0
8           M9        9          3.0
9          M10        9          3.0
10         M11        9          3.0

如果要将

NaN

替换为

：

df['new_cluster'] = df['new_cluster'].fillna('x')
print (df)
   measurement  cluster new_cluster
0           M1        6           1
1           M2        6           1
2           M3        6           1
3           M4       12           2
4           M5       12           2
5           M6       12           2
6           M7        2           x
7           M8        9           3
8           M9        9           3
9          M10        9           3
10         M11        9           3

布尔掩码的详细信息-首先为连续值创建帮助程序

系列

，然后使用keep='False'掩码以返回所有重复：

print (df['cluster'].ne(df['cluster'].shift()).cumsum())
0     1
1     1
2     1
3     2
4     2
5     2
6     3
7     4
8     4
9     4
10    4
Name: cluster, dtype: int32

print (m)
0      True
1      True
2      True
3      True
4      True
5      True
6     False
7      True
8      True
9      True
10     True
Name: cluster, dtype: bool

安装程序

解决方案

根据你的规则M1应该是x，不是吗？哦，我的天啊…这太神奇了。如此简洁，没有循环！谢谢你，耶斯雷尔！这不考虑非连续的集群值。这意味着，如果集群2稍后出现在数据集中，它将是重复的，但不等于相邻的值。附带问题，您是如何加载数据的？使用pd.read_剪贴板？@piRSquared-我同意，因此添加了一般解决方案我现在很高兴（-：