Python 仅在数据帧中填充缺失的值（熊猫）_Python_Pandas

Python 仅在数据帧中填充缺失的值（熊猫）

python pandas

Python 仅在数据帧中填充缺失的值（熊猫）,python,pandas,Python,Pandas,数据帧中的内容： email user_name sessions ymo a@a.com JD 1 2015-03-01 a@a.com JD 2 2015-05-01 我需要的是： email user_name sessions ymo a@a.com JD 0 2015-01-01 a@a.com JD 0 2015-02-01 a@a.com JD 1 201

数据帧中的内容：

email    user_name    sessions    ymo
a@a.com    JD    1    2015-03-01
a@a.com    JD    2    2015-05-01

我需要的是：

email    user_name    sessions    ymo
a@a.com    JD    0    2015-01-01
a@a.com    JD    0    2015-02-01
a@a.com    JD    1    2015-03-01
a@a.com    JD    0    2015-04-01
a@a.com    JD    2    2015-05-01
a@a.com    JD    0    2015-06-01
a@a.com    JD    0    2015-07-01
a@a.com    JD    0    2015-08-01
a@a.com    JD    0    2015-09-01
a@a.com    JD    0    2015-10-01
a@a.com    JD    0    2015-11-01
a@a.com    JD    0    2015-12-01

ymo

列为

pd。时间戳

s：

all_ymo

[Timestamp('2015-01-01 00:00:00'),
 Timestamp('2015-02-01 00:00:00'),
 Timestamp('2015-03-01 00:00:00'),
 Timestamp('2015-04-01 00:00:00'),
 Timestamp('2015-05-01 00:00:00'),
 Timestamp('2015-06-01 00:00:00'),
 Timestamp('2015-07-01 00:00:00'),
 Timestamp('2015-08-01 00:00:00'),
 Timestamp('2015-09-01 00:00:00'),
 Timestamp('2015-10-01 00:00:00'),
 Timestamp('2015-11-01 00:00:00'),
 Timestamp('2015-12-01 00:00:00')]

不幸的是，这个答案并不好，因为它为现有的

ymo

值创建了重复项

我试过这样的方法，但速度非常慢：

生成月份开始日期和重新索引
```
ffill
```
和
```
bfill
```
列
```
['email'，'user\u name']
```
```
fillna（0）
```
列
```
“会话”
```

生成月份开始日期和重新索引
```
ffill
```
和
```
bfill
```
列
```
['email'，'user\u name']
```
```
fillna（0）
```
列
```
“会话”
```

我尝试使用

句点创建更通用的解决方案

：

print (df)
     email user_name  sessions        ymo
0  a@a.com        JD         1 2015-03-01
1  a@a.com        JD         2 2015-05-01
2  b@b.com        AB         1 2015-03-01
3  b@b.com        AB         2 2015-05-01


mbeg = pd.period_range('2015-01', periods=12, freq='M')
print (mbeg)
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04', '2015-05', '2015-06',
             '2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12'],
            dtype='int64', freq='M')
#convert column ymo to period
df.ymo = df.ymo.dt.to_period('m')
#groupby and reindex with filling 0
df = df.groupby(['email','user_name'])
       .apply(lambda x: x.set_index('ymo')
       .reindex(mbeg, fill_value=0)
       .drop(['email','user_name'], axis=1))
       .rename_axis(('email','user_name','ymo'))
       .reset_index()

然后，如果需要

datetimes

使用：

带有日期时间的解决方案：

print (df)
     email user_name  sessions        ymo
0  a@a.com        JD         1 2015-03-01
1  a@a.com        JD         2 2015-05-01
2  b@b.com        AB         1 2015-03-01
3  b@b.com        AB         2 2015-05-01

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin()

df = df.groupby(['email','user_name'])
        .apply(lambda x: x.set_index('ymo')
        .reindex(mbeg, fill_value=0)
        .drop(['email','user_name'], axis=1))
        .rename_axis(('email','user_name','ymo'))
        .reset_index()

我尝试使用

句点创建更通用的解决方案：
print (df)
     email user_name  sessions        ymo
0  a@a.com        JD         1 2015-03-01
1  a@a.com        JD         2 2015-05-01
2  b@b.com        AB         1 2015-03-01
3  b@b.com        AB         2 2015-05-01


mbeg = pd.period_range('2015-01', periods=12, freq='M')
print (mbeg)
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04', '2015-05', '2015-06',
             '2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12'],
            dtype='int64', freq='M')
#convert column ymo to period
df.ymo = df.ymo.dt.to_period('m')
#groupby and reindex with filling 0
df = df.groupby(['email','user_name'])
       .apply(lambda x: x.set_index('ymo')
       .reindex(mbeg, fill_value=0)
       .drop(['email','user_name'], axis=1))
       .rename_axis(('email','user_name','ymo'))
       .reset_index()

然后，如果需要datetimes
使用：
带有日期时间的解决方案：
print (df)
     email user_name  sessions        ymo
0  a@a.com        JD         1 2015-03-01
1  a@a.com        JD         2 2015-05-01
2  b@b.com        AB         1 2015-03-01
3  b@b.com        AB         2 2015-05-01

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin()

df = df.groupby(['email','user_name'])
        .apply(lambda x: x.set_index('ymo')
        .reindex(mbeg, fill_value=0)
        .drop(['email','user_name'], axis=1))
        .rename_axis(('email','user_name','ymo'))
        .reset_index()

如果缺少的条目数量超过已填充的条目数量，则从pd.data\u范围开始填充新的数据帧。然后在日期匹配的位置添加会话值。如果电子邮件地址和用户名是1-FRO-1，那么只考虑其中一个在数据文件中保存内存（如果大小是问题），如果缺少的条目超过填充，然后填充一个新的数据文件，从PD.DATAYLASH开始。然后在日期匹配的位置添加会话值。如果电子邮件地址和用户名是1-FRO-1，那么只考虑其中一个在数据框中保存内存（如果大小是问题），不幸的是，如果有相同日期的另一个用户的行（如JeZray'的答案中的DF），则不起作用，这会引发“ValueError：不能从重复轴重新索引”。。不幸的是，如果另一个用户的行具有相同的日期（如jezrael回答中的df），则它不起作用，这将引发“ValueError:无法从重复轴重新编制索引”。
df.ymo = df.ymo.dt.to_timestamp()
print (df)
      email user_name        ymo  sessions
0   a@a.com        JD 2015-01-01         0
1   a@a.com        JD 2015-02-01         0
2   a@a.com        JD 2015-03-01         1
3   a@a.com        JD 2015-04-01         0
4   a@a.com        JD 2015-05-01         2
5   a@a.com        JD 2015-06-01         0
6   a@a.com        JD 2015-07-01         0
7   a@a.com        JD 2015-08-01         0
8   a@a.com        JD 2015-09-01         0
9   a@a.com        JD 2015-10-01         0
10  a@a.com        JD 2015-11-01         0
11  a@a.com        JD 2015-12-01         0
12  b@b.com        AB 2015-01-01         0
13  b@b.com        AB 2015-02-01         0
14  b@b.com        AB 2015-03-01         1
15  b@b.com        AB 2015-04-01         0
16  b@b.com        AB 2015-05-01         2
17  b@b.com        AB 2015-06-01         0
18  b@b.com        AB 2015-07-01         0
19  b@b.com        AB 2015-08-01         0
20  b@b.com        AB 2015-09-01         0
21  b@b.com        AB 2015-10-01         0
22  b@b.com        AB 2015-11-01         0
23  b@b.com        AB 2015-12-01         0

print (df)
     email user_name  sessions        ymo
0  a@a.com        JD         1 2015-03-01
1  a@a.com        JD         2 2015-05-01
2  b@b.com        AB         1 2015-03-01
3  b@b.com        AB         2 2015-05-01

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin()

df = df.groupby(['email','user_name'])
        .apply(lambda x: x.set_index('ymo')
        .reindex(mbeg, fill_value=0)
        .drop(['email','user_name'], axis=1))
        .rename_axis(('email','user_name','ymo'))
        .reset_index()

print (df)
      email user_name        ymo  sessions
0   a@a.com        JD 2015-01-01         0
1   a@a.com        JD 2015-02-01         0
2   a@a.com        JD 2015-03-01         1
3   a@a.com        JD 2015-04-01         0
4   a@a.com        JD 2015-05-01         2
5   a@a.com        JD 2015-06-01         0
6   a@a.com        JD 2015-07-01         0
7   a@a.com        JD 2015-08-01         0
8   a@a.com        JD 2015-09-01         0
9   a@a.com        JD 2015-10-01         0
10  a@a.com        JD 2015-11-01         0
11  a@a.com        JD 2015-12-01         0
12  b@b.com        AB 2015-01-01         0
13  b@b.com        AB 2015-02-01         0
14  b@b.com        AB 2015-03-01         1
15  b@b.com        AB 2015-04-01         0
16  b@b.com        AB 2015-05-01         2
17  b@b.com        AB 2015-06-01         0
18  b@b.com        AB 2015-07-01         0
19  b@b.com        AB 2015-08-01         0
20  b@b.com        AB 2015-09-01         0
21  b@b.com        AB 2015-10-01         0
22  b@b.com        AB 2015-11-01         0
23  b@b.com        AB 2015-12-01         0