Pandas 使用groupby后,数据框中缺少日期列
我通过读取Excel文件创建了一个数据框:Pandas 使用groupby后,数据框中缺少日期列,pandas,dataframe,pandas-groupby,Pandas,Dataframe,Pandas Groupby,我通过读取Excel文件创建了一个数据框: Project Release Name Cycle Name Cycle Start Date Cycle End Date Exec Date Planned Exec Date Available Test Cases Planned Tested Passed Failed Blocked No Run Tester B1 Y1 CM1 2/7/2018 2/20/2018 2/6/2018
Project Release Name Cycle Name Cycle Start Date Cycle End Date Exec Date Planned Exec Date Available Test Cases Planned Tested Passed Failed Blocked No Run Tester
B1 Y1 CM1 2/7/2018 2/20/2018 2/6/2018 2/6/2018 2 10 8 8 0 0 0 Tester3
B1 Y1 CM1 2/7/2018 2/20/2018 2/7/2018 2/7/2018 2 13 10 9 1 1 0 Tester3
B1 Y1 CM1 2/7/2018 2/20/2018 2/8/2018 2/8/2018 0 1 1 1 0 0 0 Tester3
B1 Y1 CM1 2/7/2018 2/20/2018 2/9/2018 2/9/2018 0 2 2 2 0 0 0 Tester3
B1 Y1 CM1 2/7/2018 2/20/2018 2/10/2018 2/10/2018 0 2 2 2 0 0 0 Tester3
B1 Y1 CL1 2/7/2018 2/25/2018 2/1/2018 2/1/2018 5 25 20 20 0 0 0 Tester 4
B1 Y1 CL1 2/7/2018 2/25/2018 2/2/2018 2/2/2018 10 30 20 18 2 0 0 Tester 4
B1 Y1 CL1 2/7/2018 2/25/2018 2/3/2018 2/3/2018 0 2 2 0 2 0 0 Tester 4
B1 Y1 CL1 1/17/2018 2/25/2018 2/4/2018 2/4/2018 0 3 3 1 2 0 0 Tester 4
B1 Y1 CL1 1/17/2018 2/25/2018 2/5/2018 2/5/2018 5 32 25 20 4 1 0 Tester 4
C1 Z1 CK1 1/10/2018 2/20/2018 2/3/2018 2/3/2018 0 1 1 0 1 0 0 Tester5
C1 Z1 CK1 1/10/2018 2/20/2018 2/4/2018 2/4/2018 0 1 1 0 1 0 0 Tester5
C1 Z1 CK1 1/10/2018 2/20/2018 2/5/2018 2/5/2018 0 1 1 0 1 0 0 Tester5
C1 Z1 CK1 1/10/2018 2/20/2018 2/6/2018 2/6/2018 0 1 1 1 0 0 0 Tester5
C1 Z1 CK1 1/10/2018 2/20/2018 2/7/2018 2/7/2018 0 1 1 1 0 0 0 Tester6
C1 Z1 CK1 1/10/2018 2/20/2018 2/8/2018 2/8/2018 0 1 1 1 0 0 0 Tester6
C1 Z1 CK2 1/17/2018 2/18/2018 2/6/2018 2/6/2018 0 1 1 1 0 0 0 Tester6
C1 Z1 CK2 1/17/2018 2/18/2018 2/7/2018 2/7/2018 0 2 2 0 2 0 0 Tester6
C1 Z1 CK2 1/17/2018 2/18/2018 2/8/2018 2/8/2018 0 2 2 0 2 0 0 Tester7
C1 Z1 CK2 1/17/2018 2/18/2018 2/9/2018 2/9/2018 0 2 2 0 2 0 0 Tester7
C1 Z1 CK2 1/17/2018 2/18/2018 2/10/2018 2/10/2018 0 2 2 1 1 0 0 Tester7
C1 Z1 CK2 1/17/2018 2/18/2018 2/11/2018 2/11/2018 0 2 2 2 0 0 0 Tester7
我正在使用pandas groupby,如下所示:
dx1 = pd.read_excel('Trend.xlsx',sheetname='Execution by Date')
dx1 = dx1.groupby(['Project', 'Release Name', 'Cycle Name', 'Cycle Start Date',
'Cycle End Date'])['Exec Date','Planned Exec Date', 'Available Test Cases', 'Planned', 'Tested', 'Passed', 'Failed',
'Blocked'].sum().reset_index()
下面是我得到的结果:
Project Release Name Cycle Name Cycle Start Date Cycle End Date Available Test Cases Planned Tested Passed Failed Blocked
B1 Y1 CL1 2018-01-17 00:00:00 2018-02-25 00:00:00 5 35 28 21 6 1
B1 Y1 CL1 2018-02-07 00:00:00 2018-02-25 00:00:00 15 57 42 38 4 0
B1 Y1 CM1 2018-02-07 00:00:00 2018-02-20 00:00:00 4 28 23 22 1 1
C1 Z1 CK1 2018-01-10 00:00:00 2018-02-20 00:00:00 0 6 6 3 3 0
C1 Z1 CK2 2018-01-17 00:00:00 2018-02-18 00:00:00 0 11 11 4 7 0
如您所见,缺少“执行日期”和“计划执行日期”
如何将两个缺少的日期列都带回数据框中。
我尝试了所有看似相关的解决方案,但没有一个对我有效。你不能。你是按
['Project', 'Release Name', 'Cycle Name', 'Cycle Start Date', 'Cycle End Date']
每个组合的执行日期
和计划执行日期
都有多个不同的值。换句话说,例如,您有3个不同的值,并且您只能保留一个。似乎groupby()
不会为您选择其中任何一个,只需将该列从结果中删除即可。但是,您可以手动执行此操作,然后将它们合并到您的groupby()
结果中:
import pandas as pd
pd.set_option("display.width", 300)
import sys
# Read in data set
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
test_data = StringIO("""Project;Release Name;Cycle Name;Cycle Start Date;Cycle End Date;Exec Date;Planned Exec Date;Available Test Cases;Planned;Tested;Passed;Failed;Blocked;No Run;Tester
B1;Y1;CM1;2/7/2018;2/20/2018;2/6/2018;2/6/2018;2;10;8;8;0;0;0;Tester3
B1;Y1;CM1;2/7/2018;2/20/2018;2/7/2018;2/7/2018;2;13;10;9;1;1;0;Tester3
B1;Y1;CM1;2/7/2018;2/20/2018;2/8/2018;2/8/2018;0;1;1;1;0;0;0;Tester3
B1;Y1;CM1;2/7/2018;2/20/2018;2/9/2018;2/9/2018;0;2;2;2;0;0;0;Tester3
B1;Y1;CM1;2/7/2018;2/20/2018;2/10/2018;2/10/2018;0;2;2;2;0;0;0;Tester3
B1;Y1;CL1;2/7/2018;2/25/2018;2/1/2018;2/1/2018;5;25;20;20;0;0;0;Tester4
B1;Y1;CL1;2/7/2018;2/25/2018;2/2/2018;2/2/2018;10;30;20;18;2;0;0;Tester4
B1;Y1;CL1;2/7/2018;2/25/2018;2/3/2018;2/3/2018;0;2;2;0;2;0;0;Tester4
B1;Y1;CL1;1/17/2018;2/25/2018;2/4/2018;2/4/2018;0;3;3;1;2;0;0;Tester4
B1;Y1;CL1;1/17/2018;2/25/2018;2/5/2018;2/5/2018;5;32;25;20;4;1;0;Tester4
C1;Z1;CK1;1/10/2018;2/20/2018;2/3/2018;2/3/2018;0;1;1;0;1;0;0;Tester5
C1;Z1;CK1;1/10/2018;2/20/2018;2/4/2018;2/4/2018;0;1;1;0;1;0;0;Tester5
C1;Z1;CK1;1/10/2018;2/20/2018;2/5/2018;2/5/2018;0;1;1;0;1;0;0;Tester5
C1;Z1;CK1;1/10/2018;2/20/2018;2/6/2018;2/6/2018;0;1;1;1;0;0;0;Tester5
C1;Z1;CK1;1/10/2018;2/20/2018;2/7/2018;2/7/2018;0;1;1;1;0;0;0;Tester6
C1;Z1;CK1;1/10/2018;2/20/2018;2/8/2018;2/8/2018;0;1;1;1;0;0;0;Tester6
C1;Z1;CK2;1/17/2018;2/18/2018;2/6/2018;2/6/2018;0;1;1;1;0;0;0;Tester6
C1;Z1;CK2;1/17/2018;2/18/2018;2/7/2018;2/7/2018;0;2;2;0;2;0;0;Tester6
C1;Z1;CK2;1/17/2018;2/18/2018;2/8/2018;2/8/2018;0;2;2;0;2;0;0;Tester7
C1;Z1;CK2;1/17/2018;2/18/2018;2/9/2018;2/9/2018;0;2;2;0;2;0;0;Tester7
C1;Z1;CK2;1/17/2018;2/18/2018;2/10/2018;2/10/2018;0;2;2;1;1;0;0;Tester7
C1;Z1;CK2;1/17/2018;2/18/2018;2/11/2018;2/11/2018;0;2;2;2;0;0;0;Tester7""")
df = pd.read_csv(test_data, sep=";")
new_df = df.groupby(['Project', 'Release Name', 'Cycle Name', 'Cycle Start Date', 'Cycle End Date'])['Exec Date','Planned Exec Date', 'Available Test Cases', 'Planned', 'Tested', 'Passed', 'Failed', 'Blocked'].sum().reset_index()
print new_df
然后,您可以再次执行groupby()
,但只保留第一次。现在将显示缺少的列,因为列中没有歧义:
# Get first occurrence of "Exec Date" and "Planned Exec Date"
firsts = df.groupby(['Project', 'Release Name', 'Cycle Name', 'Cycle Start Date', "Cycle End Date"]).first().reset_index()
print firsts
firsts
看起来像:
Project Release Name Cycle Name Cycle Start Date Cycle End Date Exec Date Planned Exec Date Available Test Cases Planned Tested Passed Failed Blocked No Run Tester
0 B1 Y1 CL1 1/17/2018 2/25/2018 2/4/2018 2/4/2018 0 3 3 1 2 0 0 Tester4
1 B1 Y1 CL1 2/7/2018 2/25/2018 2/1/2018 2/1/2018 5 25 20 20 0 0 0 Tester4
2 B1 Y1 CM1 2/7/2018 2/20/2018 2/6/2018 2/6/2018 2 10 8 8 0 0 0 Tester3
3 C1 Z1 CK1 1/10/2018 2/20/2018 2/3/2018 2/3/2018 0 1 1 0 1 0 0 Tester5
4 C1 Z1 CK2 1/17/2018 2/18/2018 2/6/2018 2/6/2018 0 1 1 1 0 0 0 Tester6
Project Release Name Cycle Name Cycle Start Date Cycle End Date Available Test Cases Planned Tested Passed Failed Blocked Exec Date Planned Exec Date
0 B1 Y1 CL1 1/17/2018 2/25/2018 5 35 28 21 6 1 2/4/2018 2/4/2018
1 B1 Y1 CL1 2/7/2018 2/25/2018 15 57 42 38 4 0 2/1/2018 2/1/2018
2 B1 Y1 CM1 2/7/2018 2/20/2018 4 28 23 22 1 1 2/6/2018 2/6/2018
3 C1 Z1 CK1 1/10/2018 2/20/2018 0 6 6 3 3 0 2/3/2018 2/3/2018
4 C1 Z1 CK2 1/17/2018 2/18/2018 0 11 11 4 7 0 2/6/2018 2/6/2018
然后将初始的groupby()
结果(带总和的结果)与包含缺少列的groupby()
结果合并:
# Merge in the missing columns into the result from the groupby
new_df_with_missing_columns = new_df.merge(firsts[["Project", "Release Name", "Cycle Name", "Cycle Start Date", "Cycle End Date", "Exec Date", "Planned Exec Date"]], on=["Project", "Release Name", "Cycle Name", "Cycle Start Date", "Cycle End Date"])
print new_df_with_missing_columns
new_df_,其中缺少列
如下所示:
Project Release Name Cycle Name Cycle Start Date Cycle End Date Exec Date Planned Exec Date Available Test Cases Planned Tested Passed Failed Blocked No Run Tester
0 B1 Y1 CL1 1/17/2018 2/25/2018 2/4/2018 2/4/2018 0 3 3 1 2 0 0 Tester4
1 B1 Y1 CL1 2/7/2018 2/25/2018 2/1/2018 2/1/2018 5 25 20 20 0 0 0 Tester4
2 B1 Y1 CM1 2/7/2018 2/20/2018 2/6/2018 2/6/2018 2 10 8 8 0 0 0 Tester3
3 C1 Z1 CK1 1/10/2018 2/20/2018 2/3/2018 2/3/2018 0 1 1 0 1 0 0 Tester5
4 C1 Z1 CK2 1/17/2018 2/18/2018 2/6/2018 2/6/2018 0 1 1 1 0 0 0 Tester6
Project Release Name Cycle Name Cycle Start Date Cycle End Date Available Test Cases Planned Tested Passed Failed Blocked Exec Date Planned Exec Date
0 B1 Y1 CL1 1/17/2018 2/25/2018 5 35 28 21 6 1 2/4/2018 2/4/2018
1 B1 Y1 CL1 2/7/2018 2/25/2018 15 57 42 38 4 0 2/1/2018 2/1/2018
2 B1 Y1 CM1 2/7/2018 2/20/2018 4 28 23 22 1 1 2/6/2018 2/6/2018
3 C1 Z1 CK1 1/10/2018 2/20/2018 0 6 6 3 3 0 2/3/2018 2/3/2018
4 C1 Z1 CK2 1/17/2018 2/18/2018 0 11 11 4 7 0 2/6/2018 2/6/2018
欢迎来到SO。请提供一份报告。另外,请通读:谢谢你的回复,我想知道我们是否可以在这篇文章中做类似的事情:不同的是,这篇文章涉及日期的年份部分,列数较少。