Python 如何在熊猫中生成考虑NaN的序列
我有一个包含NaN和True作为值的序列。我希望另一个序列生成一个数字序列,这样每当NaN出现时,将该序列值设置为0,并且在两个NaN行之间,我需要执行cumcount i、 e 输入:Python 如何在熊猫中生成考虑NaN的序列,python,pandas,boolean,nan,cumsum,Python,Pandas,Boolean,Nan,Cumsum,我有一个包含NaN和True作为值的序列。我希望另一个序列生成一个数字序列,这样每当NaN出现时,将该序列值设置为0,并且在两个NaN行之间,我需要执行cumcount i、 e 输入: colA NaN True True True True NaN True NaN NaN True True True True True 输出 ColA Sequence NaN 0 True 0 True 1 True 2 True 3 NaN 0 True
colA
NaN
True
True
True
True
NaN
True
NaN
NaN
True
True
True
True
True
输出
ColA Sequence
NaN 0
True 0
True 1
True 2
True 3
NaN 0
True 0
NaN 0
NaN 0
True 0
True 1
True 2
True 3
True 4
如何在熊猫中执行此操作?您可以在此处使用
groupby
+cumcount
+mask
:
m = df.colA.isnull()
df['Sequence'] = df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)
或者,在最后一步中使用clip\u lower
,您不必预缓存m
:
df['Sequence'] = df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)
计时
df = pd.concat([df] * 10000, ignore_index=True)
注意,您的里程数可能会有所不同,具体取决于数据。如果性能很重要,最好不要使用
groupby
连续计数True
s:
a = df['colA'].notnull()
b = a.cumsum()
df['Sequence'] = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
print (df)
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
说明:
df = pd.DataFrame({'colA':[np.nan,True,True,True,True,np.nan,
True,np.nan,np.nan,True,True,True,True,True]})
a = df['colA'].notnull()
#cumulative sum, Trues are processes like 1
b = a.cumsum()
#replace Trues from a to NaNs
c = b.mask(a)
#add 1 for count from 0
d = b.mask(a).add(1)
#forward fill NaNs, replace possible first NaNs to 0 and cast to int
e = b.mask(a).add(1).ffill().fillna(0).astype(int)
#substract b for counts
f = b-b.mask(a).add(1).ffill().fillna(0).astype(int)
#replace -1 to 0 by mask a
g = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
#all together
df = pd.concat([a,b,c,d,e,f,g], axis=1, keys=list('abcdefg'))
print (df)
a b c d e f g
0 False 0 0.0 1.0 1 -1 0
1 True 1 NaN NaN 1 0 0
2 True 2 NaN NaN 1 1 1
3 True 3 NaN NaN 1 2 2
4 True 4 NaN NaN 1 3 3
5 False 4 4.0 5.0 5 -1 0
6 True 5 NaN NaN 5 0 0
7 False 5 5.0 6.0 6 -1 0
8 False 5 5.0 6.0 6 -1 0
9 True 6 NaN NaN 6 0 0
10 True 7 NaN NaN 6 1 1
11 True 8 NaN NaN 6 2 2
12 True 9 NaN NaN 6 3 3
13 True 10 NaN NaN 6 4 4
试试这个:
df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
完整示例:
>>> df = pd.DataFrame({'colA':[np.NaN,True,True,True,True,np.NaN,True,np.NaN,np.NaN,True,True,True,True,True]})
>>> df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
>>> df
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
派对迟到了,但这里有一个函数包装的
numpy
解决方案:
import pandas as pd, numpy as np
df = pd.DataFrame({'ColA': [np.nan, True, True, True, True, np.nan, True,
np.nan, np.nan, True, True, True, True, True]})
def return_cumsum(df):
v = np.array(df.ColA, dtype=float)
n = np.isnan(v)
v[n] = -np.diff(np.concatenate(([0.], np.cumsum(~n)[n])))
df['Sequence'] = np.array(np.maximum(0, np.cumsum(v)-1), dtype=int)
return df
df = return_cumsum(df)
# ColA Sequence
# 0 NaN 0
# 1 True 0
# 2 True 1
# 3 True 2
# 4 True 3
# 5 NaN 0
# 6 True 0
# 7 NaN 0
# 8 NaN 0
# 9 True 0
# 10 True 1
# 11 True 2
# 12 True 3
# 13 True 4
到目前为止你都做了些什么?groupby和fillnaThanks做计时——我也打算做,但你做得更快@耶斯雷尔:没问题,我喜欢彻底的回答@耶斯雷尔:没问题。顺便说一句,回答得很好;)@耶斯雷尔也比我的好,我得到800毫秒:|
df = pd.DataFrame({'colA':[np.nan,True,True,True,True,np.nan,
True,np.nan,np.nan,True,True,True,True,True]})
a = df['colA'].notnull()
#cumulative sum, Trues are processes like 1
b = a.cumsum()
#replace Trues from a to NaNs
c = b.mask(a)
#add 1 for count from 0
d = b.mask(a).add(1)
#forward fill NaNs, replace possible first NaNs to 0 and cast to int
e = b.mask(a).add(1).ffill().fillna(0).astype(int)
#substract b for counts
f = b-b.mask(a).add(1).ffill().fillna(0).astype(int)
#replace -1 to 0 by mask a
g = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
#all together
df = pd.concat([a,b,c,d,e,f,g], axis=1, keys=list('abcdefg'))
print (df)
a b c d e f g
0 False 0 0.0 1.0 1 -1 0
1 True 1 NaN NaN 1 0 0
2 True 2 NaN NaN 1 1 1
3 True 3 NaN NaN 1 2 2
4 True 4 NaN NaN 1 3 3
5 False 4 4.0 5.0 5 -1 0
6 True 5 NaN NaN 5 0 0
7 False 5 5.0 6.0 6 -1 0
8 False 5 5.0 6.0 6 -1 0
9 True 6 NaN NaN 6 0 0
10 True 7 NaN NaN 6 1 1
11 True 8 NaN NaN 6 2 2
12 True 9 NaN NaN 6 3 3
13 True 10 NaN NaN 6 4 4
df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
>>> df = pd.DataFrame({'colA':[np.NaN,True,True,True,True,np.NaN,True,np.NaN,np.NaN,True,True,True,True,True]})
>>> df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
>>> df
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
import pandas as pd, numpy as np
df = pd.DataFrame({'ColA': [np.nan, True, True, True, True, np.nan, True,
np.nan, np.nan, True, True, True, True, True]})
def return_cumsum(df):
v = np.array(df.ColA, dtype=float)
n = np.isnan(v)
v[n] = -np.diff(np.concatenate(([0.], np.cumsum(~n)[n])))
df['Sequence'] = np.array(np.maximum(0, np.cumsum(v)-1), dtype=int)
return df
df = return_cumsum(df)
# ColA Sequence
# 0 NaN 0
# 1 True 0
# 2 True 1
# 3 True 2
# 4 True 3
# 5 NaN 0
# 6 True 0
# 7 NaN 0
# 8 NaN 0
# 9 True 0
# 10 True 1
# 11 True 2
# 12 True 3
# 13 True 4