Python功能工程师移动数据
嗨,我正在尝试将患者数据集从一个移动级别扩展到另一个患者级别 原始df如下所示:Python功能工程师移动数据,python,python-3.x,pandas,Python,Python 3.x,Pandas,嗨,我正在尝试将患者数据集从一个移动级别扩展到另一个患者级别 原始df如下所示: 条件: 1) 创建最后一次测试更改cols-对于遇到类别值“ICU”的案例编号,在“ICU”值之前进行测试更改(对于Test1,案例编号1为189-180),否则进行最新的测试更改(对于Test1,案例编号2为256-266)。 2) 创建Test\u模式cols-对于遇到Category值'ICU'的CaseNo,将所有Test值从开始旋转到'ICU'值之前。否则,从开始到结束旋转所有测试值。 3) 创建最后
条件:
1) 创建
最后一次测试更改
cols-对于遇到类别
值“ICU”的案例编号
,在“ICU”值之前进行测试
更改(对于Test1,案例编号1为189-180),否则进行最新的测试
更改(对于Test1,案例编号2为256-266)。2) 创建
Test\u模式
cols-对于遇到Category
值'ICU'的CaseNo
,将所有Test
值从开始旋转到'ICU'值之前。否则,从开始到结束旋转所有测试值。
3) 创建最后一次测试计数
cols-对于遇到类别
值“ICU”的案例编号
,取遇到“ICU”之前的最后一次测试
值。否则取最后一个测试值
预期结果:
我如何在Python中实现这一点
测向代码:
df = pd.DataFrame({'CaseNo':[1,1,1,1,2,2,2,2],
'Movement_Sequence_No':[1,2,3,4,1,2,3,4],
'Movement_Start_Date':['2020-02-09 22:17:00','2020-02-10 17:19:41','2020-02-17 08:04:19',
'2020-02-18 11:22:52','2020-02-12 23:00:00','2020-02-24 10:26:35',
'2020-03-03 17:50:00','2020-03-17 08:24:19'],
'Movement_End_Date':['2020-02-10 17:19:41','2020-02-17 08:04:19','2020-02-18 11:22:52',
'2020-02-25 13:55:37','2020-02-24 10:26:35','2020-03-03 17:50:00',
'2222-12-31 23:00:00','2020-03-18 18:50:00'],
'Category':['A','A','ICU','A','B','B','B','B'],
'RequestDate':['2020-02-10 16:00:00','2020-02-16 13:04:20','2020-02-18 07:11:11','2020-02-21 21:30:30',
'2020-02-13 22:00:00','NA','2020-03-15 09:40:00','2020-03-18 15:10:10'],
'Test1':['180','189','190','188','328','NA','266','256'],
'Test2':['20','21','15','10','33','30','28','15'],
'Test3':['55','NA','65','70','58','64','68','58'],
'Age':['65','65','65','65','45','45','45','45']})
预期结果:
df2 = pd.DataFrame({'CaseNo':[1, 2],
'Last Test1 Change':[9, -10],
'Test1 Pattern':['180, 189', '328, 266, 256'],
'Last Test1 Count':[189, 256],
'Last Test2 Change':[1, -13],
'Test2 Pattern':['20, 21', '33, 30, 28, 15'],
'Last Test2 Count':[21, 15],
'Last Test3 Change':[10, -10],
'Test3 Pattern':['55', '58, 64, 68, 58'],
'Last Test3 Count':[55, 58],
'Age':[65, 45]})
我只是想告诉你如何用一般的方式解决你的问题
对于第一种情况,您可以通过cumsum
创建助手索引,以过滤ICU后的数据:
df["helper"] = df.groupby("CaseNo")["Category"].transform(lambda d: d.eq("ICU").cumsum())
我不确定n
代表什么,但如果您只是想获取一定数量的数据,请使用groupby
和tail
:
s = df.loc[df["helper"].eq(0)].groupby("CaseNo").tail(4).filter(regex="CaseNo|Test.*|Age")
print (s)
CaseNo Test1 Test2 Test3 Age
0 1 180.0 20.0 55.0 65
1 1 189.0 21.0 NaN 65
4 2 328.0 33.0 58.0 45
5 2 NaN 30.0 64.0 45
6 2 266.0 28.0 68.0 45
7 2 256.0 15.0 58.0 45
最后,pivot
您的数据:
res = (pd.pivot_table(s, index=["CaseNo", "Age"],
aggfunc=["last", list]).reset_index())
print (res)
CaseNo Age last list
Test1 Test2 Test3 Test1 Test2 Test3
0 1 65 189.0 21.0 55.0 [180.0, 189.0] [20.0, 21.0] [55.0, nan]
1 2 45 256.0 15.0 58.0 [328.0, nan, 266.0, 256.0] [33.0, 30.0, 28.0, 15.0] [58.0, 64.0, 68.0, 58.0]
从这里,您可以朝着最终目标努力。尝试以下方法:
df = pd.DataFrame({'CaseNo':[1,1,1,1,2,2,2,2],
'Movement_Sequence_No':[1,2,3,4,1,2,3,4],
'Movement_Start_Date':['2020-02-09 22:17:00','2020-02-10 17:19:41','2020-02-17 08:04:19',
'2020-02-18 11:22:52','2020-02-12 23:00:00','2020-02-24 10:26:35',
'2020-03-03 17:50:00','2020-03-17 08:24:19'],
'Movement_End_Date':['2020-02-10 17:19:41','2020-02-17 08:04:19','2020-02-18 11:22:52',
'2020-02-25 13:55:37','2020-02-24 10:26:35','2020-03-03 17:50:00',
'2222-12-31 23:00:00','2020-03-18 18:50:00'],
'Category':['A','A','ICU','A','B','B','B','B'],
'RequestDate':['2020-02-10 16:00:00','2020-02-16 13:04:20','2020-02-18 07:11:11','2020-02-21 21:30:30',
'2020-02-13 22:00:00','NA','2020-03-15 09:40:00','2020-03-18 15:10:10'],
'Test1':['180','189','190','188','328','NA','266','256'],
'Test2':['20','21','15','10','33','30','28','15'],
'Test3':['55','NA','65','70','58','64','68','58'],
'Age':['65','65','65','65','45','45','45','45']})
# simple data management
df = df.replace('NA', np.nan)
df[['Test1','Test2','Test3','Age']] = df[['Test1','Test2','Test3','Age']].astype(float)
# create empty df to store results
results = pd.DataFrame()
# split original df in groups based on CaseNo
for jj,(j,gr) in enumerate(df.groupby('CaseNo')):
group = gr.copy()
# idenfify the presence of ICU
group['Category'] = (group['Category'].values == 'ICU').cumsum()
# replace NaN value with the next useful value
# this is useful to fill NaN in Test1, Test2, Test3
group_fill = group.fillna(method='bfill')
# select part of df before the first ICU matched
group_fill = group_fill[group_fill.Category == 0]
group = group[group.Category == 0]
# at this point we have two copy of our group df (group and group_fill)
# group contains the raw (inclused NaN) values for a selected CaseNo
# group_fill contains the filled values for a selected CaseNo
# create empty df to store partial results
partial = pd.DataFrame()
# select unique CaseNo
partial['CaseNo'] = group['CaseNo'].unique()
# for loop to make operation on Test1, Test2 and Test3
for i in range(1,4):
# these are simply the operation you required
# NB: 'Last TestN Change' is computed on the group df without NaN
# this is important to avoid errors when the last obsevarion is NaN
# 'TestN Pattern' and 'Last TestN Count' can be computed on the filled group df
partial[f'Last Test{i} Change'] = group_fill[f'Test{i}'].tail(2).diff().tail(1).values
partial[f'Test{i} Pattern'] = [group[f'Test{i}'].dropna().to_list()]
partial[f'Last Test{i} Count'] = group[f'Test{i}'].dropna().tail(1).values
# select unique age
partial['Age'] = group['Age'].unique()
# create correct index for the final results
partial.index = range(jj,jj+1)
# append partial results to final results df
results = results.append(partial)
# print final results df
results
Test
中的n分别代表Test1
col、Test2
col和Test3
列中的测试编号。酷。请您对您的代码添加一些解释/注释。谢谢!我在代码中添加了注释。。。我留在disposalNice code Marco,特别是.tail(2.diff().tail(1.values)部分。Prob需要感谢@Henry Yik的组['Category']=(组['Category'].值=='ICU').cumsum()
最后一个问题:为什么您选择在enumerate(df.groupby('CaseNo'):
中为jj,(j,gr)使用?我关心的是计算效率。因为我将在一个包含数十万行数据的真实数据集上应用它。。。为什么你没有选择将df转换成一个列表并进行迭代,或者在(df.groupby('CaseNo').iterrows()中为jj(j,gr)使用)
tail(2)取最后2个元素,diff()使其上的diff()产生[NaN,float],我们需要float/最后一个元素,所以tail(1)完成这项工作。使用具有复杂层次结构的cumsum来满足某些条件是一个很好的技巧,在很多情况下都可以找到。df.groupby('CaseNo').ItErrors()不可能。。。无法在groupby上计算ItError。我理解你说的话,我建议你保持原样,因为这就是工作