Python 以numpy.array()、pandas.DataFrame()或xarray.DataSet()的形式展开时间序列,以将丢失的记录包含为NaN

Python 以numpy.array()、pandas.DataFrame()或xarray.DataSet()的形式展开时间序列,以将丢失的记录包含为NaN,python,pandas,nan,python-xarray,Python,Pandas,Nan,Python Xarray,在上面的示例中,时间序列数据ds(或df)具有30个随机选择的缺失记录,而没有那些记录作为nan。因此,数据长度为365x5-30,而不是365x5) 我的问题是:如何扩展ds和df以将30个缺失值作为nan(因此,长度将为365x5)?例如,如果示例数据中缺少“2000-12-02”中的一个值,则该值如下所示: import numpy as np import pandas as pd import xarray as xr validIdx = np.ones(365*5, dtype=

在上面的示例中,时间序列数据
ds
(或
df
)具有30个随机选择的缺失记录,而没有那些记录作为nan。因此,数据长度为365x5-30,而不是365x5)

我的问题是:如何扩展
ds
df
以将30个缺失值作为nan(因此,长度将为365x5)?例如,如果示例数据中缺少“2000-12-02”中的一个值,则该值如下所示:

import numpy as np
import pandas as pd
import xarray as xr

validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()
,而我想要的是:

...
2000-12-01  value 1
2000-12-03  value 2
...

也许你可以在1小时内尝试重新采样

不带NAN的
df
(就在
df=ds.to_dataframe()
之后):

带有NAN的
df
df_1h
):

带有NaN的行:

>>> df_1h = df.resample('1H').mean()
>>> df_1h
                        foo
time
2000-01-01 00:00:00     0.0
2000-01-01 01:00:00     1.0
2000-01-01 02:00:00     2.0
2000-01-01 03:00:00     3.0
2000-01-01 04:00:00     4.0
...                     ...
2000-03-16 20:00:00  1820.0
2000-03-16 21:00:00  1821.0
2000-03-16 22:00:00  1822.0
2000-03-16 23:00:00  1823.0
2000-03-17 00:00:00  1824.0

[1825 rows x 1 columns]
df_1h
中的NAN数:

>>> df
                      foo
time
2000-01-01 00:00:00     0
2000-01-01 01:00:00     1
2000-01-01 02:00:00     2
2000-01-01 03:00:00     3
2000-01-01 04:00:00     4
...                   ...
2000-03-16 20:00:00  1820
2000-03-16 21:00:00  1821
2000-03-16 22:00:00  1822
2000-03-16 23:00:00  1823
2000-03-17 00:00:00  1824

[1795 rows x 1 columns]
>>> df_1h[df_1h['foo'].isna()]
                     foo
time
2000-01-02 10:00:00  NaN
2000-01-04 07:00:00  NaN
2000-01-05 06:00:00  NaN
2000-01-09 02:00:00  NaN
2000-01-13 15:00:00  NaN
2000-01-16 16:00:00  NaN
2000-01-18 21:00:00  NaN
2000-01-21 22:00:00  NaN
2000-01-23 19:00:00  NaN
2000-01-24 01:00:00  NaN
2000-01-24 19:00:00  NaN
2000-01-27 12:00:00  NaN
2000-01-27 16:00:00  NaN
2000-01-29 06:00:00  NaN
2000-02-02 01:00:00  NaN
2000-02-06 13:00:00  NaN
2000-02-09 11:00:00  NaN
2000-02-15 12:00:00  NaN
2000-02-15 15:00:00  NaN
2000-02-21 04:00:00  NaN
2000-02-28 05:00:00  NaN
2000-02-28 06:00:00  NaN
2000-03-01 15:00:00  NaN
2000-03-02 18:00:00  NaN
2000-03-04 18:00:00  NaN
2000-03-05 20:00:00  NaN
2000-03-12 08:00:00  NaN
2000-03-13 20:00:00  NaN
2000-03-16 01:00:00  NaN

谢谢您,它工作得很好,而且适用于
xr.Dataset()
等:
ds.resample(time='1H').mean()
。我想问的一个问题是,为什么在你的例子中,行的数量是1796,而不是1795,南的数量是29,而不是30?很高兴听到它运行良好。为了避免混淆,我对答案进行了编辑,使NaN的数量为30。我注意到29的nan计数没有被复制,这在我写答案时不知何故发生了。无论如何,我认为这不是你问题的重点。是的,这不是重点,但谢谢你澄清!
>>> df_1h[df_1h['foo'].isna()]
                     foo
time
2000-01-02 10:00:00  NaN
2000-01-04 07:00:00  NaN
2000-01-05 06:00:00  NaN
2000-01-09 02:00:00  NaN
2000-01-13 15:00:00  NaN
2000-01-16 16:00:00  NaN
2000-01-18 21:00:00  NaN
2000-01-21 22:00:00  NaN
2000-01-23 19:00:00  NaN
2000-01-24 01:00:00  NaN
2000-01-24 19:00:00  NaN
2000-01-27 12:00:00  NaN
2000-01-27 16:00:00  NaN
2000-01-29 06:00:00  NaN
2000-02-02 01:00:00  NaN
2000-02-06 13:00:00  NaN
2000-02-09 11:00:00  NaN
2000-02-15 12:00:00  NaN
2000-02-15 15:00:00  NaN
2000-02-21 04:00:00  NaN
2000-02-28 05:00:00  NaN
2000-02-28 06:00:00  NaN
2000-03-01 15:00:00  NaN
2000-03-02 18:00:00  NaN
2000-03-04 18:00:00  NaN
2000-03-05 20:00:00  NaN
2000-03-12 08:00:00  NaN
2000-03-13 20:00:00  NaN
2000-03-16 01:00:00  NaN
>>> df_1h.isnull().sum()
foo    30
dtype: int64