Python 在拉伸中查找最大空值并生成标志

Python 在拉伸中查找最大空值并生成标志,python,pandas,missing-data,Python,Pandas,Missing Data,我有一个带有datetime和两列的dataframe。我必须在列“X”的“特定日期”中找到空值的最大长度,并在该特定日期的两列中用零替换它。除此之外,我还必须创建名为“flag”的第三列,在其他两列中,每插补一次,该列的值为1,否则值为0。在下面的示例中,1月1日的最大拉伸空值是3倍,因此我必须将其替换为零。同样,我必须在1月2日重复这个过程 以下是我的样本数据: Datetime X Y 01-01-2018 00:00 1 1 01-01-2018 0

我有一个带有datetime和两列的dataframe。我必须在列“X”的“特定日期”中找到空值的最大长度,并在该特定日期的两列中用零替换它。除此之外,我还必须创建名为“flag”的第三列,在其他两列中,每插补一次,该列的值为1,否则值为0。在下面的示例中,1月1日的最大拉伸空值是3倍,因此我必须将其替换为零。同样,我必须在1月2日重复这个过程

以下是我的样本数据:

Datetime            X    Y
01-01-2018 00:00    1   1
01-01-2018 00:05    nan 2
01-01-2018 00:10    2   nan
01-01-2018 00:15    3   4
01-01-2018 00:20    2   2
01-01-2018 00:25    nan 1
01-01-2018 00:30    nan nan
01-01-2018 00:35    nan nan
01-01-2018 00:40    4   4
02-01-2018 00:00    nan nan
02-01-2018 00:05    2   3
02-01-2018 00:10    2   2
02-01-2018 00:15    2   5
02-01-2018 00:20    2   2
02-01-2018 00:25    nan nan
02-01-2018 00:30    nan 1
02-01-2018 00:35    3   nan
02-01-2018 00:40    nan nan
“以下是我期待的结果”


这个问题是上一个问题的延伸。以下是链接

首先为由唯一值填充的每个列创建连续组:

df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
                        X      Y
Datetime                        
2018-01-01 00:00:00   NaN    NaN
2018-01-01 00:05:00   2.0    NaN
2018-01-01 00:10:00   NaN   36.0
2018-01-01 00:15:00   NaN    NaN
2018-01-01 00:20:00   NaN    NaN
2018-01-01 00:25:00   4.0    NaN
2018-01-01 00:30:00   4.0   72.0
2018-01-01 00:35:00   4.0   72.0
2018-01-01 00:40:00   NaN    NaN
2018-02-01 00:00:00   6.0  108.0
2018-02-01 00:05:00   NaN    NaN
2018-02-01 00:10:00   NaN    NaN
2018-02-01 00:15:00   NaN    NaN
2018-02-01 00:20:00   NaN    NaN
2018-02-01 00:25:00   8.0  144.0
2018-02-01 00:30:00   8.0    NaN
2018-02-01 00:35:00   NaN  180.0
2018-02-01 00:40:00  10.0  180.0
然后获取具有最大计数的组-此处为组
4

a = df2.stack().value_counts().index[0]
print (a)
4.0
获取设置
0
Flag
的匹配行掩码,将掩码转换为整数到
Tru/False
1/0
映射:

mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)

编辑:

为列表中的匹配日期添加了新条件:

dates = df.index.floor('d')

filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]

df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)



是否需要找到每天的最大值?或者每一天有一个最大值,这里是
3
?@Jezrael,实际上我想找到一些过滤日期的最大值。在这种情况下,日期为2018年1月1日。但有可能我必须使用日期列表。这个问题是我先前提出的问题的延伸。这是我上一个问题的链接@Jezrael。伟大的如果我必须使用日期列表,那么我必须将null的最大值屏蔽并替换为零,这只对应于列表中的日期,而不是所有日期。这是你刚才回答的问题的延伸。这是@Jazreal的链接,太棒了!我觉得这完全没问题。非常感谢你的努力。
print (df)
                       X    Y  Flag
Datetime                           
2018-01-01 00:00:00  1.0  1.0     0
2018-01-01 00:05:00  NaN  2.0     0
2018-01-01 00:10:00  2.0  NaN     0
2018-01-01 00:15:00  3.0  4.0     0
2018-01-01 00:20:00  2.0  2.0     0
2018-01-01 00:25:00  0.0  0.0     1
2018-01-01 00:30:00  0.0  0.0     1
2018-01-01 00:35:00  0.0  0.0     1
2018-01-01 00:40:00  4.0  4.0     0
2018-02-01 00:00:00  NaN  NaN     0
2018-02-01 00:05:00  2.0  3.0     0
2018-02-01 00:10:00  2.0  2.0     0
2018-02-01 00:15:00  2.0  5.0     0
2018-02-01 00:20:00  2.0  2.0     0
2018-02-01 00:25:00  NaN  NaN     0
2018-02-01 00:30:00  NaN  1.0     0
2018-02-01 00:35:00  3.0  NaN     0
2018-02-01 00:40:00  NaN  NaN     0
dates = df.index.floor('d')

filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]

df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
                       X     Y
Datetime                      
2018-01-01 00:00:00  NaN   NaN
2018-01-01 00:05:00  2.0   NaN
2018-01-01 00:10:00  NaN  36.0
2018-01-01 00:15:00  NaN   NaN
2018-01-01 00:20:00  NaN   NaN
2018-01-01 00:25:00  4.0   NaN
2018-01-01 00:30:00  4.0  72.0
2018-01-01 00:35:00  4.0  72.0
2018-01-01 00:40:00  NaN   NaN
2018-02-01 00:00:00  NaN   NaN
2018-02-01 00:05:00  NaN   NaN
2018-02-01 00:10:00  NaN   NaN
2018-02-01 00:15:00  NaN   NaN
2018-02-01 00:20:00  NaN   NaN
2018-02-01 00:25:00  NaN   NaN
2018-02-01 00:30:00  NaN   NaN
2018-02-01 00:35:00  NaN   NaN
2018-02-01 00:40:00  NaN   NaN

a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)

mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
                       X    Y  Flag
Datetime                           
2018-01-01 00:00:00  1.0  1.0     0
2018-01-01 00:05:00  NaN  2.0     0
2018-01-01 00:10:00  2.0  NaN     0
2018-01-01 00:15:00  3.0  4.0     0
2018-01-01 00:20:00  2.0  2.0     0
2018-01-01 00:25:00  0.0  0.0     1
2018-01-01 00:30:00  0.0  0.0     1
2018-01-01 00:35:00  0.0  0.0     1
2018-01-01 00:40:00  4.0  4.0     0
2018-02-01 00:00:00  NaN  NaN     0
2018-02-01 00:05:00  2.0  3.0     0
2018-02-01 00:10:00  2.0  2.0     0
2018-02-01 00:15:00  2.0  5.0     0
2018-02-01 00:20:00  2.0  2.0     0
2018-02-01 00:25:00  NaN  NaN     0
2018-02-01 00:30:00  NaN  1.0     0
2018-02-01 00:35:00  3.0  NaN     0