Database 日期表重组
我需要做一个表的转换,我不知道从哪里开始。这是表格:Database 日期表重组,database,pandas,dataframe,Database,Pandas,Dataframe,我需要做一个表的转换,我不知道从哪里开始。这是表格: | Customer Code | Activity | Start Date | |:---------------:|:--------:|:----------:| | 100 | A | 01/05/2017 | | 100 | A | 19/07/2017 | | 100 | B | 18/09/2017 | |
| Customer Code | Activity | Start Date |
|:---------------:|:--------:|:----------:|
| 100 | A | 01/05/2017 |
| 100 | A | 19/07/2017 |
| 100 | B | 18/09/2017 |
| 100 | C | 07/12/2017 |
| 101 | A | 11/02/2018 |
| 101 | B | 02/04/2018 |
| 101 | B | 14/06/2018 |
| 100 | A | 13/07/2018 |
| 100 | B | 14/08/2018 |
客户可以始终按照该顺序执行活动A、B和C。要执行活动B,他/她必须执行活动A。要执行活动C,他/她必须先执行活动A,然后执行活动B。同一客户可以多次执行活动或周期
我需要以这种方式重新组织表格,放置每个步骤的开头和结尾:
| Customer Code | Activity | Start Date | End Date |
|:---------------:|:--------:|:----------:|:----------:|
| 100 | A | 01/05/2017 | 18/09/2017 |
| 100 | B | 18/09/2017 | 07/12/2017 |
| 100 | C | 07/12/2017 | 13/07/2018 |
| 101 | A | 11/02/2018 | 02/04/2018 |
| 101 | B | 02/04/2018 | |
| 100 | A | 13/07/2018 | 14/08/2018 |
| 100 | B | 14/08/2018 | |
谢谢!:-) IIUC,您可以使用:
df['Start Date'] = pd.to_datetime(df['Start Date'])
grp = (df['Customer Code'] != df['Customer Code'].shift()).cumsum().rename('grp')
df_out = df.groupby([grp,'Customer Code', 'Activity'])['Start Date'].min().reset_index()
df_out['End Date'] = df_out.groupby('Customer Code')['Start Date'].shift(-1)
df_out
输出:
grp Customer Code Activity Start Date End Date
0 1 100 A 2017-01-05 2017-09-18
1 1 100 B 2017-09-18 2017-07-12
2 1 100 C 2017-07-12 2018-07-13
3 2 101 A 2018-11-02 2018-02-04
4 2 101 B 2018-02-04 NaT
5 3 100 A 2018-07-13 2018-08-14
6 3 100 B 2018-08-14 NaT
Customer Code Activity Start Date grp End Date
0 100 A 2017-01-05 1 2017-09-18
2 100 B 2017-09-18 1 2017-07-12
3 100 C 2017-07-12 1 2018-07-13
4 101 A 2018-11-02 2 2018-02-04
5 101 B 2018-02-04 2 NaT
7 100 A 2018-07-13 3 2018-08-14
8 100 B 2018-08-14 3 NaT
细节:
首先根据客户代码的变化创建grp,将相同的客户代码分组在一起,在grp中找到每个活动的最小开始日期。接下来,按“客户代码”分组,并将下一个活动的开始日期上移到“结束日期”
使用
删除重复项的类似方法
:
df['grp'] = (df['Customer Code'] != df['Customer Code'].shift()).cumsum()
df = df.drop_duplicates(['grp','Customer Code', 'Activity']).copy()
df['End Date'] = df.groupby('Customer Code')['Start Date'].shift(-1)
df
输出:
grp Customer Code Activity Start Date End Date
0 1 100 A 2017-01-05 2017-09-18
1 1 100 B 2017-09-18 2017-07-12
2 1 100 C 2017-07-12 2018-07-13
3 2 101 A 2018-11-02 2018-02-04
4 2 101 B 2018-02-04 NaT
5 3 100 A 2018-07-13 2018-08-14
6 3 100 B 2018-08-14 NaT
Customer Code Activity Start Date grp End Date
0 100 A 2017-01-05 1 2017-09-18
2 100 B 2017-09-18 1 2017-07-12
3 100 C 2017-07-12 1 2018-07-13
4 101 A 2018-11-02 2 2018-02-04
5 101 B 2018-02-04 2 NaT
7 100 A 2018-07-13 3 2018-08-14
8 100 B 2018-08-14 3 NaT
你能解释一下101的2nd B在哪里吗?