Python 基于时间列合并两个数据帧

Python 基于时间列合并两个数据帧,python,pandas,Python,Pandas,注意:我之前对相同的数据问了一个类似的问题,但现在我尝试以不同的方式合并数据帧 我有两个数据框,用于存储不同类型的患者医疗信息。两个数据帧的共同元素是遭遇ID(hadm_ID),即记录信息的时间((n | c)e_charttime) 一个数据框(ds)包含结构化信息,另一个数据框(dn)包含一列,其中包含在特定时间记录的临床记录。这两个数据帧都包含多个遭遇,但公共元素是遭遇ID(hadm_ID) 以下是数据帧的示例: ds hadm_id ce_charttime hr sbp

注意:我之前对相同的数据问了一个类似的问题,但现在我尝试以不同的方式合并数据帧

我有两个数据框,用于存储不同类型的患者医疗信息。两个数据帧的共同元素是遭遇ID(
hadm_ID
),即记录信息的时间(
(n | c)e_charttime

一个数据框(
ds
)包含结构化信息,另一个数据框(
dn
)包含一列,其中包含在特定时间记录的临床记录。这两个数据帧都包含多个遭遇,但公共元素是遭遇ID(
hadm_ID

以下是数据帧的示例:

ds
    hadm_id ce_charttime    hr  sbp dbp
0   140694  2121-08-12 19:00:00 67.0    102.0   75.0
1   140694  2121-08-12 19:45:00 68.0    135.0   68.0
2   140694  2121-08-12 20:00:00 70.0    153.0   94.0
3   171544  2153-09-06 14:11:00 80.0    114.0   50.0
4   171544  2153-09-06 17:30:00 80.0    114.0   50.0
5   171544  2153-09-06 17:35:00 80.0    114.0   50.0
6   171544  2153-09-06 17:40:00 76.0    115.0   51.0
7   171544  2153-09-06 17:45:00 79.0    117.0   53.0
实际数据包括近10000次会面、250000多行结构化数据和50000行临床记录

我想根据绘制信息的时间合并它们。例如,如果您从两个数据帧中获取一次相遇,并根据charttime对它们进行排序,我希望得到结果数据帧中的所有信息,并使用
NaN
s查找缺少的值。例如,如果上述两个数据帧是输入,则生成的数据帧如下所示:

final
    hadm_id charttime   ce_charttime    hr  sbp dbp ne_charttime    note
0   140694  2121-08-10 20:32:00 NaT NaN NaN NaN 2121-08-10 20:32:00 some text1
1   140694  2121-08-11 12:57:00 NaT NaN NaN NaN 2121-08-11 12:57:00 some text2
2   140694  2121-08-11 15:18:00 NaT NaN NaN NaN 2121-08-11 15:18:00 some text3
3   140694  2121-08-12 19:00:00 2121-08-12 19:00:00 67.0    102.0   75.0    NaT NaN
4   140694  2121-08-12 19:45:00 2121-08-12 19:45:00 68.0    135.0   68.0    NaT NaN
5   140694  2121-08-12 20:00:00 2121-08-12 20:00:00 70.0    153.0   94.0    NaT NaN
6   171544  2153-09-05 15:09:00 NaT NaN NaN NaN 2153-09-05 15:09:00 some text4
7   171544  2153-09-05 17:43:00 NaT NaN NaN NaN 2153-09-05 17:43:00 some text5
8   171544  2153-09-06 10:36:00 NaT NaN NaN NaN 2153-09-06 10:36:00 some text6
9   171544  2153-09-06 14:11:00 2153-09-06 14:11:00 80.0    114.0   50.0    NaT NaN
10  171544  2153-09-06 15:55:00 NaT NaN NaN NaN 2153-09-06 15:55:00 some text7
11  171544  2153-09-06 17:12:00 NaT NaN NaN NaN 2153-09-06 17:12:00 some text8
12  171544  2153-09-06 17:30:00 2153-09-06 17:30:00 80.0    114.0   50.0    NaT NaN
13  171544  2153-09-06 17:35:00 2153-09-06 17:35:00 80.0    114.0   50.0    NaT NaN
14  171544  2153-09-06 17:40:00 2153-09-06 17:40:00 76.0    115.0   51.0    NaT NaN
15  171544  2153-09-06 17:45:00 2153-09-06 17:45:00 76.0    117.0   53.0    NaT NaN
我实际上手动输入了这个结果数据框,我想用pandas生成这个数据框。最后,我将删除
ce_charttime
ne_charttime
,只保留新创建的
charttime
列,并在以后适当地填充缺少的值。任何帮助是感激的,请让我知道如果需要更多的信息

谢谢

最后,我将删除
ce_charttime
ne_charttime
,只保留新创建的
charttime

您可以在连接两个数据帧之前执行此操作,然后可以使用pandas
concat
函数将它们附加到单个数据帧中

import pandas as pd
from datetime import datetime

def parse_datetime(strftime):
    datetime.strptime(strftime, '%Y-%m-%d %H:%M:%S')

# here I'm assuming both dataframes share a column `charttime` on the same axis
data1 = pd.read_csv('data1.csv', parse_dates=True, date_parser=parse_datetime)
data2 = pd.read_csv('data2.csv', parse_dates=True, date_parser=parse_datetime)

print(data1.head(10), end='\n\n')
print(data2.head(10), end='\n\n')

data = pd.concat([data1, data2],  axis=0, sort=True)
data.sort_values(by=['charttime'], inplace=True)
data.reset_index(drop=True, inplace=True)
print(data.head(20))
下面是上面代码的输出:

   hadm_id            charttime    hr    sbp   dbp
0   140694  2121-08-12 19:00:00  67.0  102.0  75.0
1   140694  2121-08-12 19:45:00  68.0  135.0  68.0
2   140694  2121-08-12 20:00:00  70.0  153.0  94.0
3   171544  2153-09-06 14:11:00  80.0  114.0  50.0
4   171544  2153-09-06 17:30:00  80.0  114.0  50.0
5   171544  2153-09-06 17:35:00  80.0  114.0  50.0
6   171544  2153-09-06 17:40:00  76.0  115.0  51.0
7   171544  2153-09-06 17:45:00  79.0  117.0  53.0

   hadm_id            charttime        note
0   140694  2121-08-10 20:32:00  some text1
1   140694  2121-08-11 12:57:00  some text2
2   140694  2121-08-11 15:18:00  some text3
3   171544  2153-09-05 15:09:00  some text4
4   171544  2153-09-05 17:43:00  some text5
5   171544  2153-09-06 10:36:00  some text6
6   171544  2153-09-06 15:55:00  some text7
7   171544  2153-09-06 17:12:00  some text8

              charttime   dbp  hadm_id    hr        note    sbp
0   2121-08-10 20:32:00   NaN   140694   NaN  some text1    NaN
1   2121-08-11 12:57:00   NaN   140694   NaN  some text2    NaN
2   2121-08-11 15:18:00   NaN   140694   NaN  some text3    NaN
3   2121-08-12 19:00:00  75.0   140694  67.0         NaN  102.0
4   2121-08-12 19:45:00  68.0   140694  68.0         NaN  135.0
5   2121-08-12 20:00:00  94.0   140694  70.0         NaN  153.0
6   2153-09-05 15:09:00   NaN   171544   NaN  some text4    NaN
7   2153-09-05 17:43:00   NaN   171544   NaN  some text5    NaN
8   2153-09-06 10:36:00   NaN   171544   NaN  some text6    NaN
9   2153-09-06 14:11:00  50.0   171544  80.0         NaN  114.0
10  2153-09-06 15:55:00   NaN   171544   NaN  some text7    NaN
11  2153-09-06 17:12:00   NaN   171544   NaN  some text8    NaN
12  2153-09-06 17:30:00  50.0   171544  80.0         NaN  114.0
13  2153-09-06 17:35:00  50.0   171544  80.0         NaN  114.0
14  2153-09-06 17:40:00  51.0   171544  76.0         NaN  115.0
15  2153-09-06 17:45:00  53.0   171544  79.0         NaN  117.0

谢谢两个问题:1)我注意到这段代码中没有groupby
hadm\u id
。每次遭遇是否都能正确应用所有操作?因为可能会有同时发生但使用不同的
hadm\u id
的遭遇。2) 如果同时采集两种类型的数据,会发生什么情况?
   hadm_id            charttime    hr    sbp   dbp
0   140694  2121-08-12 19:00:00  67.0  102.0  75.0
1   140694  2121-08-12 19:45:00  68.0  135.0  68.0
2   140694  2121-08-12 20:00:00  70.0  153.0  94.0
3   171544  2153-09-06 14:11:00  80.0  114.0  50.0
4   171544  2153-09-06 17:30:00  80.0  114.0  50.0
5   171544  2153-09-06 17:35:00  80.0  114.0  50.0
6   171544  2153-09-06 17:40:00  76.0  115.0  51.0
7   171544  2153-09-06 17:45:00  79.0  117.0  53.0

   hadm_id            charttime        note
0   140694  2121-08-10 20:32:00  some text1
1   140694  2121-08-11 12:57:00  some text2
2   140694  2121-08-11 15:18:00  some text3
3   171544  2153-09-05 15:09:00  some text4
4   171544  2153-09-05 17:43:00  some text5
5   171544  2153-09-06 10:36:00  some text6
6   171544  2153-09-06 15:55:00  some text7
7   171544  2153-09-06 17:12:00  some text8

              charttime   dbp  hadm_id    hr        note    sbp
0   2121-08-10 20:32:00   NaN   140694   NaN  some text1    NaN
1   2121-08-11 12:57:00   NaN   140694   NaN  some text2    NaN
2   2121-08-11 15:18:00   NaN   140694   NaN  some text3    NaN
3   2121-08-12 19:00:00  75.0   140694  67.0         NaN  102.0
4   2121-08-12 19:45:00  68.0   140694  68.0         NaN  135.0
5   2121-08-12 20:00:00  94.0   140694  70.0         NaN  153.0
6   2153-09-05 15:09:00   NaN   171544   NaN  some text4    NaN
7   2153-09-05 17:43:00   NaN   171544   NaN  some text5    NaN
8   2153-09-06 10:36:00   NaN   171544   NaN  some text6    NaN
9   2153-09-06 14:11:00  50.0   171544  80.0         NaN  114.0
10  2153-09-06 15:55:00   NaN   171544   NaN  some text7    NaN
11  2153-09-06 17:12:00   NaN   171544   NaN  some text8    NaN
12  2153-09-06 17:30:00  50.0   171544  80.0         NaN  114.0
13  2153-09-06 17:35:00  50.0   171544  80.0         NaN  114.0
14  2153-09-06 17:40:00  51.0   171544  76.0         NaN  115.0
15  2153-09-06 17:45:00  53.0   171544  79.0         NaN  117.0