Python 转换数据帧：需要更高效的解决方案_Python_Pandas_Dataframe

Python 转换数据帧：需要更高效的解决方案

python pandas dataframe

Python 转换数据帧：需要更高效的解决方案,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框，它由某个时期的日期索引。我的专栏是关于给定年份结束时变量值的预测。我的原始数据帧如下所示： 2016 2017 2018 2016-01-01 0.0 1 NaN 2016-07-01 1.0 1 4.1 2017-01-01 NaN 5 3.0 2017-07-01 NaN 2 2.0 Y_0 Y_1 Y_2 2016-01-01 0 1 Na

我有一个数据框，它由某个时期的日期索引。我的专栏是关于给定年份结束时变量值的预测。我的原始数据帧如下所示：

            2016  2017  2018
2016-01-01   0.0     1   NaN
2016-07-01   1.0     1   4.1
2017-01-01   NaN     5   3.0
2017-07-01   NaN     2   2.0

            Y_0  Y_1  Y_2
2016-01-01    0    1  NaN
2016-07-01    1    1  4.1
2017-01,01    5    3  NaN
2017-07-01    2    2  NaN

式中，NaN表示给定年份的预测不存在

由于我的工作时间超过20年，而且大多数预测都是针对未来2-3年，因此我的真实数据框架有20多列，其中大部分包含

NaN

值。例如，2005年的专栏对2003-2005年进行了预测，但在2006-2020年的范围内，所有预测都是

NaN

我想将我的数据帧转换为如下内容：

            2016  2017  2018
2016-01-01   0.0     1   NaN
2016-07-01   1.0     1   4.1
2017-01-01   NaN     5   3.0
2017-07-01   NaN     2   2.0

            Y_0  Y_1  Y_2
2016-01-01    0    1  NaN
2016-07-01    1    1  4.1
2017-01,01    5    3  NaN
2017-07-01    2    2  NaN

其中

Y_j

表示对

year=index.year+j

的预测。这样，我将有一个只有4列的数据帧（Y_0，Y_1，Y_2，Y_3）

我确实做到了这一点，但我认为这是一种非常低效的方式：


对于范围（4）中的i：
df[f'Y{i}']=numpy.nan#创建列[Y_0，Y_1，Y_2，Y_3]
对于索引，df.iterrows（）中的行：#遍历df的每一行
对于row.dropna（）.index:#遍历存在预测的每一年
year_diff=int（year）-index.year#获取预测年份与预测时间之间的差异（可能值：0、1、2或3）
df.loc[index，f'Y_{year_diff}']=df.loc[index，year]#逐个单元格设置“Y_0”、“Y_1”、“Y_2”和“Y_3”列的值。
df=df.iloc[：，-4:]#删除除新列以外的所有列

对于一个只有1000行的数据帧，这需要花费将近3秒的时间来运行。有谁能想出更好的解决方案吗？

您可以使用

melt

将其转换为长格式，然后根据年份差异重新调整

以数据帧为例：

df = pd.DataFrame({'date':[datetime.date(2016, 1, 1), datetime.date(2016, 7, 1),
                      datetime.date(2017, 1, 1), datetime.date(2017, 7, 1)],
             2016:[0,1,np.nan,np.nan],
             2017:[1,1,5,2],
             2018:[np.nan, 4.1, 3, 2]})
df = df.melt(id_vars = 'date', value_vars = [2016, 2017, 2018], var_name='prediction_year', value_name='prediction')

长格式：

    date        prediction_year prediction
0   2016-01-01  2016    0.0
1   2016-07-01  2016    1.0
2   2017-01-01  2016    NaN
3   2017-07-01  2016    NaN
4   2016-01-01  2017    1.0
5   2016-07-01  2017    1.0
6   2017-01-01  2017    5.0
7   2017-07-01  2017    2.0
8   2016-01-01  2018    NaN
9   2016-07-01  2018    4.1
10  2017-01-01  2018    3.0
11  2017-07-01  2018    2.0

转换回所需的宽格式：

df['year'] = pd.to_datetime(df['date']).dt.year
df['dt'] = df['prediction_year'] - df['year']
df = df.pivot(index = 'date', columns='dt', values='prediction').dropna(axis = 1, how = 'all').add_prefix('Y_')

让我们尝试堆叠，然后计算年差：

# in index is not already datetime
df.index = pd.to_datetime(df.index)

df = (df.stack().reset_index()
   .assign(date_diff=lambda x: x['level_1'].astype(int) - x['level_0'].dt.year)
   .pivot(index='level_0', columns='date_diff', values=0)
   .add_prefix('Y_')
)

输出：

date_diff   Y_0  Y_1  Y_2
level_0                  
2016-01-01  0.0  1.0  NaN
2016-07-01  1.0  1.0  4.1
2017-01-01  5.0  3.0  NaN
2017-07-01  2.0  2.0  NaN