Python 如何根据数据帧的条件增加计数器?
我有一堆记录,每个记录都标有一个集群值 原始数据帧,df:Python 如何根据数据帧的条件增加计数器?,python,pandas,Python,Pandas,我有一堆记录,每个记录都标有一个集群值 原始数据帧,df: +-------------+---------+ | measurement | cluster | +-------------+---------+ | M1 | 6 | | M2 | 6 | | M3 | 6 | | M4 | 12 | | M5 | 12 | | M6
+-------------+---------+
| measurement | cluster |
+-------------+---------+
| M1 | 6 |
| M2 | 6 |
| M3 | 6 |
| M4 | 12 |
| M5 | 12 |
| M6 | 12 |
| M7 | 2 |
| M8 | 9 |
| M9 | 9 |
| M10 | 9 |
| M11 | 9 |
+-------------+---------+
+-------------+---------+-------------+
| measurement | cluster | new_cluster |
+-------------+---------+-------------+
| M1 | 6 | 1 |
| M2 | 6 | 1 |
| M3 | 6 | 1 |
| M4 | 12 | 2 |
| M5 | 12 | 2 |
| M6 | 12 | 2 |
| M7 | 2 | x |
| M8 | 9 | 3 |
| M9 | 9 | 3 |
| M10 | 9 | 3 |
| M11 | 9 | 3 |
+-------------+---------+-------------+
如何根据当前群集值是否等于上一个群集值和下一个群集值,将群集重命名为新的编号,同时将其分配给群集值不等于上一个或下一个群集值的“x”行
所需的df:
+-------------+---------+
| measurement | cluster |
+-------------+---------+
| M1 | 6 |
| M2 | 6 |
| M3 | 6 |
| M4 | 12 |
| M5 | 12 |
| M6 | 12 |
| M7 | 2 |
| M8 | 9 |
| M9 | 9 |
| M10 | 9 |
| M11 | 9 |
+-------------+---------+
+-------------+---------+-------------+
| measurement | cluster | new_cluster |
+-------------+---------+-------------+
| M1 | 6 | 1 |
| M2 | 6 | 1 |
| M3 | 6 | 1 |
| M4 | 12 | 2 |
| M5 | 12 | 2 |
| M6 | 12 | 2 |
| M7 | 2 | x |
| M8 | 9 | 3 |
| M9 | 9 | 3 |
| M10 | 9 | 3 |
| M11 | 9 | 3 |
+-------------+---------+-------------+
用于按掩码过滤的值:
m = df['cluster'].ne(df['cluster'].shift()).cumsum().duplicated(keep=False)
df.loc[m, 'new_cluster'] = pd.factorize(df.loc[m, 'cluster'])[0] + 1
print (df)
measurement cluster new_cluster
0 M1 6 1.0
1 M2 6 1.0
2 M3 6 1.0
3 M4 12 2.0
4 M5 12 2.0
5 M6 12 2.0
6 M7 2 NaN
7 M8 9 3.0
8 M9 9 3.0
9 M10 9 3.0
10 M11 9 3.0
如果要将NaN
替换为x
:
df['new_cluster'] = df['new_cluster'].fillna('x')
print (df)
measurement cluster new_cluster
0 M1 6 1
1 M2 6 1
2 M3 6 1
3 M4 12 2
4 M5 12 2
5 M6 12 2
6 M7 2 x
7 M8 9 3
8 M9 9 3
9 M10 9 3
10 M11 9 3
布尔掩码的详细信息-首先为连续值创建帮助程序系列
,然后使用keep='False'掩码以返回所有重复:
print (df['cluster'].ne(df['cluster'].shift()).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 4
8 4
9 4
10 4
Name: cluster, dtype: int32
print (m)
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
9 True
10 True
Name: cluster, dtype: bool
安装程序
解决方案
根据你的规则M1应该是x,不是吗?哦,我的天啊…这太神奇了。如此简洁,没有循环!谢谢你,耶斯雷尔!这不考虑非连续的集群值。这意味着,如果集群2稍后出现在数据集中,它将是重复的,但不等于相邻的值。附带问题,您是如何加载数据的?使用pd.read_剪贴板?@piRSquared-我同意,因此添加了一般解决方案我现在很高兴(-: