Python 以numpy.array（）、pandas.DataFrame（）或xarray.DataSet（）的形式展开时间序列，以将丢失的记录包含为NaN_Python_Pandas_Nan_Python Xarray

Python 以numpy.array（）、pandas.DataFrame（）或xarray.DataSet（）的形式展开时间序列，以将丢失的记录包含为NaN

python pandas

Python 以numpy.array（）、pandas.DataFrame（）或xarray.DataSet（）的形式展开时间序列，以将丢失的记录包含为NaN,python,pandas,nan,python-xarray,Python,Pandas,Nan,Python Xarray,在上面的示例中，时间序列数据ds（或df）具有30个随机选择的缺失记录，而没有那些记录作为nan。因此，数据长度为365x5-30，而不是365x5）我的问题是：如何扩展ds和df以将30个缺失值作为nan（因此，长度将为365x5）？例如，如果示例数据中缺少“2000-12-02”中的一个值，则该值如下所示： import numpy as np import pandas as pd import xarray as xr validIdx = np.ones(365*5, dtype=

在上面的示例中，时间序列数据

ds

（或

df

）具有30个随机选择的缺失记录，而没有那些记录作为nan。因此，数据长度为365x5-30，而不是365x5）
我的问题是：如何扩展
ds
和
df
以将30个缺失值作为nan（因此，长度将为365x5）？例如，如果示例数据中缺少“2000-12-02”中的一个值，则该值如下所示：

import numpy as np import pandas as pd import xarray as xr validIdx = np.ones(365*5, dtype= bool) validIdx[np.random.randint(low=0, high=365*5, size=30)] = False time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx] data = np.arange(365 * 5)[validIdx] ds = xr.Dataset({"foo": ("time", data), "time": time}) df = ds.to_dataframe()
，而我想要的是：

... 2000-12-01 value 1 2000-12-03 value 2 ...

也许你可以在1小时内尝试重新采样
不带NAN的
df
（就在
df=ds.to_dataframe（）
之后）：
带有NAN的
df
（
df_1h
）：
带有NaN的行：

>>> df_1h = df.resample('1H').mean() >>> df_1h foo time 2000-01-01 00:00:00 0.0 2000-01-01 01:00:00 1.0 2000-01-01 02:00:00 2.0 2000-01-01 03:00:00 3.0 2000-01-01 04:00:00 4.0 ... ... 2000-03-16 20:00:00 1820.0 2000-03-16 21:00:00 1821.0 2000-03-16 22:00:00 1822.0 2000-03-16 23:00:00 1823.0 2000-03-17 00:00:00 1824.0 [1825 rows x 1 columns]

df_1h
中的NAN数：

>>> df foo time 2000-01-01 00:00:00 0 2000-01-01 01:00:00 1 2000-01-01 02:00:00 2 2000-01-01 03:00:00 3 2000-01-01 04:00:00 4 ... ... 2000-03-16 20:00:00 1820 2000-03-16 21:00:00 1821 2000-03-16 22:00:00 1822 2000-03-16 23:00:00 1823 2000-03-17 00:00:00 1824 [1795 rows x 1 columns]

>>> df_1h[df_1h['foo'].isna()] foo time 2000-01-02 10:00:00 NaN 2000-01-04 07:00:00 NaN 2000-01-05 06:00:00 NaN 2000-01-09 02:00:00 NaN 2000-01-13 15:00:00 NaN 2000-01-16 16:00:00 NaN 2000-01-18 21:00:00 NaN 2000-01-21 22:00:00 NaN 2000-01-23 19:00:00 NaN 2000-01-24 01:00:00 NaN 2000-01-24 19:00:00 NaN 2000-01-27 12:00:00 NaN 2000-01-27 16:00:00 NaN 2000-01-29 06:00:00 NaN 2000-02-02 01:00:00 NaN 2000-02-06 13:00:00 NaN 2000-02-09 11:00:00 NaN 2000-02-15 12:00:00 NaN 2000-02-15 15:00:00 NaN 2000-02-21 04:00:00 NaN 2000-02-28 05:00:00 NaN 2000-02-28 06:00:00 NaN 2000-03-01 15:00:00 NaN 2000-03-02 18:00:00 NaN 2000-03-04 18:00:00 NaN 2000-03-05 20:00:00 NaN 2000-03-12 08:00:00 NaN 2000-03-13 20:00:00 NaN 2000-03-16 01:00:00 NaN

谢谢您，它工作得很好，而且适用于
xr.Dataset（）
等：
ds.resample（time='1H'）.mean（）
。我想问的一个问题是，为什么在你的例子中，行的数量是1796，而不是1795，南的数量是29，而不是30？很高兴听到它运行良好。为了避免混淆，我对答案进行了编辑，使NaN的数量为30。我注意到29的nan计数没有被复制，这在我写答案时不知何故发生了。无论如何，我认为这不是你问题的重点。是的，这不是重点，但谢谢你澄清！
>>> df_1h[df_1h['foo'].isna()] foo time 2000-01-02 10:00:00 NaN 2000-01-04 07:00:00 NaN 2000-01-05 06:00:00 NaN 2000-01-09 02:00:00 NaN 2000-01-13 15:00:00 NaN 2000-01-16 16:00:00 NaN 2000-01-18 21:00:00 NaN 2000-01-21 22:00:00 NaN 2000-01-23 19:00:00 NaN 2000-01-24 01:00:00 NaN 2000-01-24 19:00:00 NaN 2000-01-27 12:00:00 NaN 2000-01-27 16:00:00 NaN 2000-01-29 06:00:00 NaN 2000-02-02 01:00:00 NaN 2000-02-06 13:00:00 NaN 2000-02-09 11:00:00 NaN 2000-02-15 12:00:00 NaN 2000-02-15 15:00:00 NaN 2000-02-21 04:00:00 NaN 2000-02-28 05:00:00 NaN 2000-02-28 06:00:00 NaN 2000-03-01 15:00:00 NaN 2000-03-02 18:00:00 NaN 2000-03-04 18:00:00 NaN 2000-03-05 20:00:00 NaN 2000-03-12 08:00:00 NaN 2000-03-13 20:00:00 NaN 2000-03-16 01:00:00 NaN

>>> df_1h.isnull().sum() foo 30 dtype: int64