Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/arduino/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 尝试将多行中的列值混合/合并到一行中_Python_Pandas_Dataframe_Merge_Blending - Fatal编程技术网

Python 尝试将多行中的列值混合/合并到一行中

Python 尝试将多行中的列值混合/合并到一行中,python,pandas,dataframe,merge,blending,Python,Pandas,Dataframe,Merge,Blending,我正在尝试使用pandas聚合一些数据,以便创建两个新列来存储原始数据集中的值,以减少总行数 比如说 d = pd.DataFrame([['0001', None, 'backlog', '2020-01-15', '2020-01-31'], ['0001', 'backlog', 'complete', '2020-01-31', '9999-12-31'], ['0001', 'backlog', 'complet

我正在尝试使用pandas聚合一些数据,以便创建两个新列来存储原始数据集中的值,以减少总行数

比如说

d = pd.DataFrame([['0001', None, 'backlog', '2020-01-15', '2020-01-31'], 
                  ['0001', 'backlog', 'complete', '2020-01-31', '9999-12-31'],
                  ['0001', 'backlog', 'complete', '2020-01-31', '9999-12-31'],
                  ['0002', None, 'backlog', '2019-02-15', '2019-02-25'], 
                  ['0002', None, 'backlog', '2019-02-15', '2019-02-25'],
                  ['0002', None, 'backlog', '2019-02-15', '2019-02-25'],
                  ['0002', None, 'backlog', '2019-02-15', '2019-02-25'],
                  ['0002', 'backlog', 'complete', '2019-02-25', '9999-12-31'],
                  ['0003', None, 'backlog', '2020-01-15', '2020-01-31'],
                  ['0003', None, 'backlog', '2020-01-15', '2020-01-31'],
                  ['0003', None, 'backlog', '2020-01-15', '2020-01-31'],
                  ['0003', 'backlog', 'modified', '2020-01-31', '2020-02-05'],
                  ['0003', 'modified', 'qe_backlog', '2020-02-05', '2020-02-20'],
                  ['0003', 'qe_backlog', 'verified', '2020-02-20', '9999-12-31']] ,
                 columns=['id', 'old_state', 'new_state', 'start_dttm', 'end_dttm'])
导致

      id   old_state   new_state  start_dttm    end_dttm
0   0001        None     backlog  2020-01-15  2020-01-31
1   0001     backlog    complete  2020-01-31  9999-12-31
2   0001     backlog    complete  2020-01-31  9999-12-31
3   0002        None     backlog  2019-02-15  2019-02-25
4   0002        None     backlog  2019-02-15  2019-02-25
5   0002        None     backlog  2019-02-15  2019-02-25
6   0002        None     backlog  2019-02-15  2019-02-25
7   0002     backlog    complete  2019-02-25  9999-12-31
8   0003        None     backlog  2020-01-15  2020-01-31
9   0003        None     backlog  2020-01-15  2020-01-31
10  0003        None     backlog  2020-01-15  2020-01-31
11  0003     backlog    modified  2020-01-31  2020-02-05
12  0003    modified  qe_backlog  2020-02-05  2020-02-20
13  0003  qe_backlog    verified  2020-02-20  9999-12-31
最后我想说的是:

id   state       backlog_dttm      completed_dttm  modified_dttm qe_backlog_dttm    verified_dttm
0001 complete     2020-01-15       2020-01-31           null            null       null 
0002 complete     2019-02-15       2019-02-25      null        null     null      null
0003 verified     2020-01-15       null            2020-01-31        2020-02-05        2020-02-20          
到目前为止我有

d.drop_duplicates(subset=d.columns, keep='last', inplace=True)
d.set_index('id', inplace=True)
然后在这一点上,试图设置backlog_dttm,事情就停止了

d2['backlog_dttm'] = d[d['old_state'].isnull() & (d['new_state'] == 'backlog')]['start_dttm']
d2 = d.loc[d['end_dttm'] == d.end_dttm.max()]
d2.loc[d2.index,'backlog_dttm'] = d[d['old_state'].isnull() & (d['new_state'] == 'backlog')]['start_dttm']
d2.loc[d2.index, 'completed_dttm'] = d[d['new_state'] == 'complete']['start_dttm']
d2.loc[d2.index, 'modified_dttm'] = d[d['new_state'] == 'modified']['start_dttm']
d2.loc[d2.index, 'qe_backlog_dttm'] = d[d['new_state'] == 'qe_backlog']['start_dttm']
上面的结果是设置了CopyWarning,但似乎有效。最终所需输出应类似于以下内容:

       old_state new_state  start_dttm    end_dttm backlog_dttm  \
id                                                                
0001     backlog  complete  2020-01-31  9999-12-31   2020-01-15   
0002     backlog  complete  2019-02-25  9999-12-31   2019-02-15   
0003  qe_backlog  verified  2020-02-20  9999-12-31   2020-01-15   

     completed_dttm modified_dttm qe_backlog_dttm  
id                                                 
0001     2020-01-31           NaN             NaN  
0002     2019-02-25           NaN             NaN  
0003            NaN    2020-01-31      2020-02-05
仅供参考:这只是一个示例,真正的数据集基于一个开发工作流,其中会有其他状态,如准备测试、验证、进行中等。。。同样地,我也需要为这些状态填充一些列,即verified_dttm、read_to_test_dttm

start_dttm和end_dttm字段用于标识记录进入给定状态的日期和离开该状态的日期


如有任何想法/建议,我们将不胜感激-谢谢大家!

如果
中的
旧状态
[无,积压]
中的
新状态
[积压,已完成]

您可以使用

df[df['old_state'].isna()].assign(old_state='complete').drop('new_state', axis=1).rename(columns={'old_state': 'state', 'start_dttm': 'backlog_dttm', 'end_dttm': 'completed_dttm'})

如果您只是想摆脱设置WithCopyWarning,您可以在添加列时指定索引

d2.loc[d2.index,'backlog_dttm'] = d[d['old_state'].isnull() & (d['new_state'] == 'backlog')]['start_dttm']
凯南的黑客解决方案似乎很好地解决了仅仅重命名列之外的问题

您可以尝试使用和:



我要试试这个;然而,我担心的是,这只是一个样本数据集。实际的数据集有20多列,因此这种方法可能容易出错。还有比我提到的州多得多的州。这是一个开发工程工作流中的数据,因此还有其他一些状态,如
读取测试
验证
,其中我还有一个列,名为ready_for_test_dttm和verified_dttm。是的,我认为我有一个很好的示例,直到我意识到我没有。完成了,我相信我有一个更准确的数据表示。
d2.loc[d2.index,'backlog_dttm'] = d[d['old_state'].isnull() & (d['new_state'] == 'backlog')]['start_dttm']
# find the most recent state for each id
df1 = df.groupby('id').agg({'new_state':'last'})
# find start dates for each new state by id and unstack into columns
df2 = df.groupby(['id','new_state'])['start_dttm'].agg('first').unstack()
# merge grouped dataframes together by id
df = df1.join(df2).reset_index() 
print(df)                                                                                                        
     id new_state     backlog    complete    modified  qe_backlog    verified
0  0001  complete  2020-01-15  2020-01-31         NaN         NaN         NaN
1  0002  complete  2019-02-15  2019-02-25         NaN         NaN         NaN
2  0003  verified  2020-01-15         NaN  2020-01-31  2020-02-05  2020-02-20