Python 为熊猫中缺少的数据组合添加值
我有一个熊猫数据框,包含如下内容:Python 为熊猫中缺少的数据组合添加值,python,pandas,Python,Pandas,我有一个熊猫数据框,包含如下内容: person_id status year count 0 'pass' 1980 4 0 'fail' 1982 1 1 'pass' 1981 2 如果我知道每个字段的所有可能值为: all_person_ids = [0, 1, 2] all_statuses = ['pass', 'fail'] all_years = [1980, 198
person_id status year count
0 'pass' 1980 4
0 'fail' 1982 1
1 'pass' 1981 2
如果我知道每个字段的所有可能值为:
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
我希望用count=0
填充原始数据框,以查找缺失的数据组合(个人id、状态和年份),即我希望新数据框包含:
person_id status year count
0 'pass' 1980 4
0 'pass' 1981 0
0 'pass' 1982 0
0 'fail' 1980 0
0 'fail' 1981 0
0 'fail' 1982 2
1 'pass' 1980 0
1 'pass' 1981 2
1 'pass' 1982 0
1 'fail' 1980 0
1 'fail' 1981 0
1 'fail' 1982 0
2 'pass' 1980 0
2 'pass' 1981 0
2 'pass' 1982 0
2 'fail' 1980 0
2 'fail' 1981 0
2 'fail' 1982 0
有没有一种有效的方法可以在pandas中实现这一点?您可以使用生成所有组合,然后从中构造一个df,它与原始df一起使用0
填充缺少的计数值:
In [77]:
import itertools
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
combined = [all_person_ids, all_statuses, all_years]
df1 = pd.DataFrame(columns = ['person_id', 'status', 'year'], data=list(itertools.product(*combined)))
df1
Out[77]:
person_id status year
0 0 pass 1980
1 0 pass 1981
2 0 pass 1982
3 0 fail 1980
4 0 fail 1981
5 0 fail 1982
6 1 pass 1980
7 1 pass 1981
8 1 pass 1982
9 1 fail 1980
10 1 fail 1981
11 1 fail 1982
12 2 pass 1980
13 2 pass 1981
14 2 pass 1982
15 2 fail 1980
16 2 fail 1981
17 2 fail 1982
In [82]:
df1 = df1.merge(df, how='left').fillna(0)
df1
Out[82]:
person_id status year count
0 0 pass 1980 4
1 0 pass 1981 0
2 0 pass 1982 0
3 0 fail 1980 0
4 0 fail 1981 0
5 0 fail 1982 1
6 1 pass 1980 0
7 1 pass 1981 2
8 1 pass 1982 0
9 1 fail 1980 0
10 1 fail 1981 0
11 1 fail 1982 0
12 2 pass 1980 0
13 2 pass 1981 0
14 2 pass 1982 0
15 2 fail 1980 0
16 2 fail 1981 0
17 2 fail 1982 0
通过多重索引创建多重索引。从_product()开始,然后
设置_index()
,重新索引()
,重置_index()
效果很好-你能大致解释一下上面的每一步都在做什么吗?(我以前不必使用
reindex
或reset_index
,但我会很快阅读它们)。reindex()
将行与新索引对齐,使用fill_value=0
将NaN填充为0。我认为您可以保留多索引
,因为您可以使用它快速选择元素。通过reset\u index()
可以将索引转换为列。
import pandas as pd
import io
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
df = pd.read_csv(io.BytesIO("""person_id status year count
0 pass 1980 4
0 fail 1982 1
1 pass 1981 2"""), delim_whitespace=True)
names = ["person_id", "status", "year"]
mind = pd.MultiIndex.from_product(
[all_person_ids, all_statuses, all_years], names=names)
df.set_index(names).reindex(mind, fill_value=0).reset_index()