Python 尝试将多行中的列值混合/合并到一行中
我正在尝试使用pandas聚合一些数据,以便创建两个新列来存储原始数据集中的值,以减少总行数 比如说Python 尝试将多行中的列值混合/合并到一行中,python,pandas,dataframe,merge,blending,Python,Pandas,Dataframe,Merge,Blending,我正在尝试使用pandas聚合一些数据,以便创建两个新列来存储原始数据集中的值,以减少总行数 比如说 d = pd.DataFrame([['0001', None, 'backlog', '2020-01-15', '2020-01-31'], ['0001', 'backlog', 'complete', '2020-01-31', '9999-12-31'], ['0001', 'backlog', 'complet
d = pd.DataFrame([['0001', None, 'backlog', '2020-01-15', '2020-01-31'],
['0001', 'backlog', 'complete', '2020-01-31', '9999-12-31'],
['0001', 'backlog', 'complete', '2020-01-31', '9999-12-31'],
['0002', None, 'backlog', '2019-02-15', '2019-02-25'],
['0002', None, 'backlog', '2019-02-15', '2019-02-25'],
['0002', None, 'backlog', '2019-02-15', '2019-02-25'],
['0002', None, 'backlog', '2019-02-15', '2019-02-25'],
['0002', 'backlog', 'complete', '2019-02-25', '9999-12-31'],
['0003', None, 'backlog', '2020-01-15', '2020-01-31'],
['0003', None, 'backlog', '2020-01-15', '2020-01-31'],
['0003', None, 'backlog', '2020-01-15', '2020-01-31'],
['0003', 'backlog', 'modified', '2020-01-31', '2020-02-05'],
['0003', 'modified', 'qe_backlog', '2020-02-05', '2020-02-20'],
['0003', 'qe_backlog', 'verified', '2020-02-20', '9999-12-31']] ,
columns=['id', 'old_state', 'new_state', 'start_dttm', 'end_dttm'])
导致
id old_state new_state start_dttm end_dttm
0 0001 None backlog 2020-01-15 2020-01-31
1 0001 backlog complete 2020-01-31 9999-12-31
2 0001 backlog complete 2020-01-31 9999-12-31
3 0002 None backlog 2019-02-15 2019-02-25
4 0002 None backlog 2019-02-15 2019-02-25
5 0002 None backlog 2019-02-15 2019-02-25
6 0002 None backlog 2019-02-15 2019-02-25
7 0002 backlog complete 2019-02-25 9999-12-31
8 0003 None backlog 2020-01-15 2020-01-31
9 0003 None backlog 2020-01-15 2020-01-31
10 0003 None backlog 2020-01-15 2020-01-31
11 0003 backlog modified 2020-01-31 2020-02-05
12 0003 modified qe_backlog 2020-02-05 2020-02-20
13 0003 qe_backlog verified 2020-02-20 9999-12-31
最后我想说的是:
id state backlog_dttm completed_dttm modified_dttm qe_backlog_dttm verified_dttm
0001 complete 2020-01-15 2020-01-31 null null null
0002 complete 2019-02-15 2019-02-25 null null null null
0003 verified 2020-01-15 null 2020-01-31 2020-02-05 2020-02-20
到目前为止我有
d.drop_duplicates(subset=d.columns, keep='last', inplace=True)
d.set_index('id', inplace=True)
然后在这一点上,试图设置backlog_dttm,事情就停止了
d2['backlog_dttm'] = d[d['old_state'].isnull() & (d['new_state'] == 'backlog')]['start_dttm']
d2 = d.loc[d['end_dttm'] == d.end_dttm.max()]
d2.loc[d2.index,'backlog_dttm'] = d[d['old_state'].isnull() & (d['new_state'] == 'backlog')]['start_dttm']
d2.loc[d2.index, 'completed_dttm'] = d[d['new_state'] == 'complete']['start_dttm']
d2.loc[d2.index, 'modified_dttm'] = d[d['new_state'] == 'modified']['start_dttm']
d2.loc[d2.index, 'qe_backlog_dttm'] = d[d['new_state'] == 'qe_backlog']['start_dttm']
上面的结果是设置了CopyWarning,但似乎有效。最终所需输出应类似于以下内容:
old_state new_state start_dttm end_dttm backlog_dttm \
id
0001 backlog complete 2020-01-31 9999-12-31 2020-01-15
0002 backlog complete 2019-02-25 9999-12-31 2019-02-15
0003 qe_backlog verified 2020-02-20 9999-12-31 2020-01-15
completed_dttm modified_dttm qe_backlog_dttm
id
0001 2020-01-31 NaN NaN
0002 2019-02-25 NaN NaN
0003 NaN 2020-01-31 2020-02-05
仅供参考:这只是一个示例,真正的数据集基于一个开发工作流,其中会有其他状态,如准备测试、验证、进行中等。。。同样地,我也需要为这些状态填充一些列,即verified_dttm、read_to_test_dttm
start_dttm和end_dttm字段用于标识记录进入给定状态的日期和离开该状态的日期
如有任何想法/建议,我们将不胜感激-谢谢大家! 如果
中的旧状态
[无,积压]
和中的新状态
[积压,已完成]
您可以使用
df[df['old_state'].isna()].assign(old_state='complete').drop('new_state', axis=1).rename(columns={'old_state': 'state', 'start_dttm': 'backlog_dttm', 'end_dttm': 'completed_dttm'})
如果您只是想摆脱设置WithCopyWarning,您可以在添加列时指定索引
d2.loc[d2.index,'backlog_dttm'] = d[d['old_state'].isnull() & (d['new_state'] == 'backlog')]['start_dttm']
凯南的黑客解决方案似乎很好地解决了仅仅重命名列之外的问题 您可以尝试使用和:
我要试试这个;然而,我担心的是,这只是一个样本数据集。实际的数据集有20多列,因此这种方法可能容易出错。还有比我提到的州多得多的州。这是一个开发工程工作流中的数据,因此还有其他一些状态,如
读取测试
,验证
,其中我还有一个列,名为ready_for_test_dttm和verified_dttm。是的,我认为我有一个很好的示例,直到我意识到我没有。完成了,我相信我有一个更准确的数据表示。
d2.loc[d2.index,'backlog_dttm'] = d[d['old_state'].isnull() & (d['new_state'] == 'backlog')]['start_dttm']
# find the most recent state for each id
df1 = df.groupby('id').agg({'new_state':'last'})
# find start dates for each new state by id and unstack into columns
df2 = df.groupby(['id','new_state'])['start_dttm'].agg('first').unstack()
# merge grouped dataframes together by id
df = df1.join(df2).reset_index()
print(df)
id new_state backlog complete modified qe_backlog verified
0 0001 complete 2020-01-15 2020-01-31 NaN NaN NaN
1 0002 complete 2019-02-15 2019-02-25 NaN NaN NaN
2 0003 verified 2020-01-15 NaN 2020-01-31 2020-02-05 2020-02-20