Python 3.x 在数据帧中的行之间添加随机数据_Python 3.x_Pandas

Python 3.x 在数据帧中的行之间添加随机数据

python-3.x pandas

Python 3.x 在数据帧中的行之间添加随机数据,python-3.x,pandas,Python 3.x,Pandas,我有一个像这样的熊猫数据框。它包含一个时间戳，id，foo和条。时间戳数据大约每10分钟一次 timestamp id foo bar 2019-04-14 00:00:10 1 0.10 0.05 2019-04-14 00:10:02 1 0.30 0.10 2019-04-14 00:00:00 2 0.10 0.05 2019-04-14 00:10:00 2 0.30 0.10 对于每个id，我想创建5额外的行，其中时间戳在连续的行

我有一个像这样的熊猫数据框。它包含一个

时间戳

，

id

，

foo

和

条

。

时间戳

数据大约每10分钟一次

timestamp            id  foo  bar
2019-04-14 00:00:10  1   0.10 0.05
2019-04-14 00:10:02  1   0.30 0.10
2019-04-14 00:00:00  2   0.10 0.05
2019-04-14 00:10:00  2   0.30 0.10

对于每个

id

，我想创建

额外的

行

，其中

时间戳

在连续的

行

之间平均分割，并且

foo

和

bar

值包含连续的

行

之间的

随机

值

每个

id

的开始时间应为最早的

时间戳

，每个

id

的结束时间应为最晚的

时间戳

所以输出是这样的

timestamp            id  foo  bar
2019-04-14 00:00:10  1   0.10 0.05
2019-04-14 00:02:10  1   0.14 0.06
2019-04-14 00:04:10  1   0.11 0.06
2019-04-14 00:06:10  1   0.29 0.07
2019-04-14 00:08:10  1   0.22 0.09
2019-04-14 00:10:02  1   0.30 0.10
2019-04-14 00:00:00  2   0.80 0.50
2019-04-14 00:02:00  2   0.45 0.48
2019-04-14 00:04:00  2   0.52 0.42
2019-04-14 00:06:00  2   0.74 0.48
2019-04-14 00:08:00  2   0.41 0.45
2019-04-14 00:10:00  2   0.40 0.40

我可以重新索引

时间戳

列，并创建额外的

时间戳

行（例如）

但是我似乎不知道如何计算连续行之间

foo

和

bar

的随机值

如果有人能给我指出正确的方向，我将不胜感激

关闭时，您需要的是使用

DatetimeIndex的第一个和最后一个值：
df['timestamp'] = pd.to_datetime(df['timestamp'])

df = (df.set_index('timestamp')
        .groupby('id')['foo','bar']
        .apply(lambda x: x.reindex(pd.date_range(x.index[0], x.index[-1], periods=6))))

然后创建与原始值和缺失值大小相同的辅助数据框：
df1 = pd.DataFrame(np.random.rand(*df.shape), index=df.index, columns=df.columns)
df = df.fillna(df1)
print (df)
                                 foo       bar
id                                            
1  2019-04-14 00:00:10.000  0.100000  0.050000
   2019-04-14 00:02:08.400  0.903435  0.755841
   2019-04-14 00:04:06.800  0.956002  0.253878
   2019-04-14 00:06:05.200  0.388454  0.257639
   2019-04-14 00:08:03.600  0.225535  0.195306
   2019-04-14 00:10:02.000  0.300000  0.100000
2  2019-04-14 00:00:00.000  0.100000  0.050000
   2019-04-14 00:02:00.000  0.180865  0.327581
   2019-04-14 00:04:00.000  0.417956  0.414400
   2019-04-14 00:06:00.000  0.012686  0.800948
   2019-04-14 00:08:00.000  0.716216  0.941396
   2019-04-14 00:10:00.000  0.300000  0.100000

如果“随机性”不那么重要。我们可以使用以下方法将每组的值保持在min
和max
之间：
df_new = pd.concat([
    d.reindex(pd.date_range(d.timestamp.min(), d.timestamp.max(), periods=6))
    for _, d in df.groupby('id')
])
df_new['timestamp'] = df_new.index
df_new.reset_index(drop=True, inplace=True)

df_new = df_new[['timestamp']].merge(df, on='timestamp', how='left')
df_new['id'].fillna(method='ffill', inplace=True)

df_new[['foo', 'bar']] = df_new[['foo', 'bar']].apply(lambda x: x.interpolate())


这将提供以下输出：
print(df_new)
                 timestamp   id   foo   bar
0  2019-04-14 00:00:10.000  1.0  0.10  0.05
1  2019-04-14 00:02:08.400  1.0  0.14  0.06
2  2019-04-14 00:04:06.800  1.0  0.18  0.07
3  2019-04-14 00:06:05.200  1.0  0.22  0.08
4  2019-04-14 00:08:03.600  1.0  0.26  0.09
5  2019-04-14 00:10:02.000  1.0  0.30  0.10
6  2019-04-14 00:00:00.000  2.0  0.10  0.05
7  2019-04-14 00:02:00.000  2.0  0.14  0.06
8  2019-04-14 00:04:00.000  2.0  0.18  0.07
9  2019-04-14 00:06:00.000  2.0  0.22  0.08
10 2019-04-14 00:08:00.000  2.0  0.26  0.09
11 2019-04-14 00:10:00.000  2.0  0.30  0.10

嘿@erfan谢谢你！随机性并不重要。但是，min
和max
值应基于连续的行。而您的解决方案提供了基于整个组的最小值和最大值的随机数据。