Python 如何基于时间间隔合并两个数据帧并进行转换
我有两个数据帧,第一个是由用户手动创建的,第二个是来自机器的错误。 我想根据第一个数据帧(df_a)中的时间间隔合并它们 以下是数据帧Python 如何基于时间间隔合并两个数据帧并进行转换,python,pandas,datetime,conditional-statements,nested-loops,Python,Pandas,Datetime,Conditional Statements,Nested Loops,我有两个数据帧,第一个是由用户手动创建的,第二个是来自机器的错误。 我想根据第一个数据帧(df_a)中的时间间隔合并它们 以下是数据帧 d_a = {'Station' : ['A1','A2'], 'Reason_a' : ['Electronic','Feed'], 'StartTime_a' : ['2019-01-02 02:00:00','2019-01-02 04:22:00'], 'EndTime_a' : ['2019-01-02 02:
d_a = {'Station' : ['A1','A2'],
'Reason_a' : ['Electronic','Feed'],
'StartTime_a' : ['2019-01-02 02:00:00','2019-01-02 04:22:00'],
'EndTime_a' : ['2019-01-02 02:20:00', '2019-01-02 04:45:00']}
d_b = {'Station' : ['A1','A1','A1','A2','A2','A2'],
'Reason_b' : ['a','n','c','d','e','n'],
'StartTime_b' : ['2019-01-02 00:00:00.000','2019-01-02 00:05:00.000','2019-01-01 23:55:00.000','2019-01-02 04:19:53.000','2019-01-02 04:19:37.000','2019-01-02 04:23:00.000'],
'EndTime_b' : ['2019-01-02 00:19:15.000','2019-01-02 00:29:45.000','2019-01-02 00:12:12.000','2019-01-02 04:27:12.000','2019-01-02 04:47:16.000','2019-01-02 04:52:45.000']}
df_a = pd.DataFrame(d_a)
df_b = pd.DataFrame(d_b)
视为有效记录的两个数据帧的时间间隔的任何交点
条件1=df_b开始时间在df_a开始时间之后开始,在df_a结束时间之前结束
条件2=df_b开始时间在df_a开始时间之前开始,但在df_a结束时间之前结束
条件3=df_b开始时间在df_a开始时间和df_a结束时间之间,但在df_a结束时间之后结束
最后,我想根据条件合并这两个数据帧。我理想的桌子如下所示
Station Reason_a a n c d e
A1 Electronic 1 1 1 0 0
A2 Feed 0 1 0 1 0
我应该如何处理这个问题?
任何评论都会有帮助
提前感谢。可以使用pandas执行这些类型的合并 假设“Station”是合并过程的附加键,则可以使用以下内容:
df_a['StartTime_a'] = pd.to_datetime(df_a['StartTime_a'])
df_b['StartTime_b'] = pd.to_datetime(df_b['StartTime_b'])
df_a['EndTime_a'] = pd.to_datetime(df_a['EndTime_a'])
df_b['EndTime_b'] = pd.to_datetime(df_b['EndTime_b'])
##before using merge_asof sorting is needed
df_a.sort_values(by='StartTime_a', inplace=True)
df_b.sort_values(by='StartTime_b', inplace=True)
##merge and filter based on first condition
cond_1 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='forward')
cond_1 = cond_1[cond_1['StartTime_b'] <= cond_1['EndTime_a']]
##merge and filter based on second condition
cond_2 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='backward')
cond_2 = cond_2[cond_2['EndTime_b'] <= cond_2['EndTime_a']]
##merge and filter based on third condition
cond_3 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='forward')
cond_3 = cond_3[cond_3['StartTime_b'] <= cond_3['EndTime_a']]
cond_3 = cond_3[cond_3['EndTime_b'] >= cond_3['EndTime_a']]
##concatenating all matches
res_df = pd.concat([cond_1, cond_2, cond_3], sort=False)
def check_condition(x):
df_1 = df_a[(df_a['StartTime_a'] < x.StartTime_b) & (df_a['EndTime_a'] > x.EndTime_b)]
df_2 = df_a[(df_a['StartTime_a'] > x.StartTime_b) & (df_a['EndTime_a'] > x.EndTime_b)]
df_3 = df_a[(df_a['StartTime_a'] < x.StartTime_b) & (df_a['EndTime_a'] > x.StartTime_b)
& (df_a['EndTime_a'] < x.EndTime_b)]
if df_1.shape[0]+df_2.shape[0] + df_3.shape[0] !=0:
return 1
else:
return 0
df_b['c'] = df_b.apply(lambda x: check_condition(x), axis=1)
df_a['StartTime_a']=pd.to_datetime(df_a['StartTime_a']))
df_b['StartTime_b']=pd.to_datetime(df_b['StartTime_b']))
df_a['EndTime_a']=pd.to_datetime(df_a['EndTime_a']))
df_b['EndTime_b']=pd.to_datetime(df_b['EndTime_b']))
##在使用合并之前,需要进行排序
df_a.sort_值(by='StartTime_a',inplace=True)
df_b.sort_值(by='StartTime_b',inplace=True)
##基于第一个条件的合并和筛选
cond_1=pd.merge_asof(df_a,df_b,by='Station',左上='StartTime'u a',
右(开始时间),方向(前进)
cond_1=cond_1[cond_1['StartTime_b']使用pandas执行这些类型的合并是可能的
假设“Station”是合并过程的附加键,则可以使用以下内容:
df_a['StartTime_a'] = pd.to_datetime(df_a['StartTime_a'])
df_b['StartTime_b'] = pd.to_datetime(df_b['StartTime_b'])
df_a['EndTime_a'] = pd.to_datetime(df_a['EndTime_a'])
df_b['EndTime_b'] = pd.to_datetime(df_b['EndTime_b'])
##before using merge_asof sorting is needed
df_a.sort_values(by='StartTime_a', inplace=True)
df_b.sort_values(by='StartTime_b', inplace=True)
##merge and filter based on first condition
cond_1 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='forward')
cond_1 = cond_1[cond_1['StartTime_b'] <= cond_1['EndTime_a']]
##merge and filter based on second condition
cond_2 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='backward')
cond_2 = cond_2[cond_2['EndTime_b'] <= cond_2['EndTime_a']]
##merge and filter based on third condition
cond_3 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='forward')
cond_3 = cond_3[cond_3['StartTime_b'] <= cond_3['EndTime_a']]
cond_3 = cond_3[cond_3['EndTime_b'] >= cond_3['EndTime_a']]
##concatenating all matches
res_df = pd.concat([cond_1, cond_2, cond_3], sort=False)
def check_condition(x):
df_1 = df_a[(df_a['StartTime_a'] < x.StartTime_b) & (df_a['EndTime_a'] > x.EndTime_b)]
df_2 = df_a[(df_a['StartTime_a'] > x.StartTime_b) & (df_a['EndTime_a'] > x.EndTime_b)]
df_3 = df_a[(df_a['StartTime_a'] < x.StartTime_b) & (df_a['EndTime_a'] > x.StartTime_b)
& (df_a['EndTime_a'] < x.EndTime_b)]
if df_1.shape[0]+df_2.shape[0] + df_3.shape[0] !=0:
return 1
else:
return 0
df_b['c'] = df_b.apply(lambda x: check_condition(x), axis=1)
df_a['StartTime_a']=pd.to_datetime(df_a['StartTime_a']))
df_b['StartTime_b']=pd.to_datetime(df_b['StartTime_b']))
df_a['EndTime_a']=pd.to_datetime(df_a['EndTime_a']))
df_b['EndTime_b']=pd.to_datetime(df_b['EndTime_b']))
##在使用合并之前,需要进行排序
df_a.sort_值(by='StartTime_a',inplace=True)
df_b.sort_值(by='StartTime_b',inplace=True)
##基于第一个条件的合并和筛选
cond_1=pd.merge_asof(df_a,df_b,by='Station',左上='StartTime'u a',
右(开始时间),方向(前进)
cond_1=cond_1[cond_1['StartTime_b']我想到了这个:
df_c = pd.merge(df_a,df_b, left_on = 'Station', right_on = 'Station')
生成日期时间:
df_c['StartTime_a'] = pd.to_datetime(df_c['StartTime_a'])
df_c['StartTime_b'] = pd.to_datetime(df_c['StartTime_b'])
df_c['EndTime_a'] = pd.to_datetime(df_c['EndTime_a'])
df_c['EndTime_b'] = pd.to_datetime(df_c['EndTime_b'])
应用lambda函数:
df_c['c'] = df_c.apply(lambda x : 1 if (x.StartTime_b > x.StartTime_a) and (x.EndTime_b < x.EndTime_a)
else (1 if (x.StartTime_b < x.StartTime_a) and (x.EndTime_b < x.EndTime_a)
else (1 if ((x.StartTime_b > x.StartTime_a) and (x.StartTime_b < x.EndTime_a)) and (x.EndTime_b > x.EndTime_a) else 0)), axis=1)
我想到了这个:
df_c = pd.merge(df_a,df_b, left_on = 'Station', right_on = 'Station')
生成日期时间:
df_c['StartTime_a'] = pd.to_datetime(df_c['StartTime_a'])
df_c['StartTime_b'] = pd.to_datetime(df_c['StartTime_b'])
df_c['EndTime_a'] = pd.to_datetime(df_c['EndTime_a'])
df_c['EndTime_b'] = pd.to_datetime(df_c['EndTime_b'])
应用lambda函数:
df_c['c'] = df_c.apply(lambda x : 1 if (x.StartTime_b > x.StartTime_a) and (x.EndTime_b < x.EndTime_a)
else (1 if (x.StartTime_b < x.StartTime_a) and (x.EndTime_b < x.EndTime_a)
else (1 if ((x.StartTime_b > x.StartTime_a) and (x.StartTime_b < x.EndTime_a)) and (x.EndTime_b > x.EndTime_a) else 0)), axis=1)
我将通过合并车站上的表格
并计算交点:D来解决这个问题
import numpy as np
df = pd.merge(df_a, df_b, on="Station")
# Convert to date
for datevar in ["StartTime_a", "StartTime_b", "EndTime_a", "EndTime_b"]:
df[datevar] = pd.to_datetime(df[datevar])
# Intersections definition
df["intersection"] = (((df.StartTime_a > df.StartTime_b) & (df.StartTime_a < df.EndTime_b)) |
((df.StartTime_a < df.StartTime_b) & (df.EndTime_a > df.StartTime_b)))
# Filter only intersections
(df[["Station", "Reason_a", "Reason_b", "intersection"]]
.pivot_table(index=["Station", "Reason_a"], columns="Reason_b", aggfunc=np.sum)
.fillna(0).astype(int))
将numpy导入为np
df=pd.merge(df_a,df_b,on=“Station”)
#转换为日期
对于[“开始时间a”、“开始时间b”、“结束时间a”、“结束时间b”]中的日期变量:
df[datevar]=pd.to_datetime(df[datevar])
#交叉口定义
df[“交叉点”]=((df.StartTime\u a>df.StartTime\u b)和(df.StartTime\u adf.StartTime_b)))
#仅过滤交点
(df[[“车站”、“原因a”、“原因b”、“交叉口”]]
.pivot\u表(索引=[“Station”,“Reason\u a”],columns=“Reason\u b”,aggfunc=np.sum)
.fillna(0.astype(int))
我将通过合并车站上的表格来解决这个问题,并计算交点:D
import numpy as np
df = pd.merge(df_a, df_b, on="Station")
# Convert to date
for datevar in ["StartTime_a", "StartTime_b", "EndTime_a", "EndTime_b"]:
df[datevar] = pd.to_datetime(df[datevar])
# Intersections definition
df["intersection"] = (((df.StartTime_a > df.StartTime_b) & (df.StartTime_a < df.EndTime_b)) |
((df.StartTime_a < df.StartTime_b) & (df.EndTime_a > df.StartTime_b)))
# Filter only intersections
(df[["Station", "Reason_a", "Reason_b", "intersection"]]
.pivot_table(index=["Station", "Reason_a"], columns="Reason_b", aggfunc=np.sum)
.fillna(0).astype(int))
将numpy导入为np
df=pd.merge(df_a,df_b,on=“Station”)
#转换为日期
对于[“开始时间a”、“开始时间b”、“结束时间a”、“结束时间b”]中的日期变量:
df[datevar]=pd.to_datetime(df[datevar])
#交叉口定义
df[“交叉点”]=((df.StartTime\u a>df.StartTime\u b)和(df.StartTime\u adf.StartTime_b)))
#仅过滤交点
(df[[“车站”、“原因a”、“原因b”、“交叉口”]]
.pivot\u表(索引=[“Station”,“Reason\u a”],columns=“Reason\u b”,aggfunc=np.sum)
.fillna(0.astype(int))
如果要避免合并,请执行以下操作:
df_a['StartTime_a'] = pd.to_datetime(df_a['StartTime_a'])
df_b['StartTime_b'] = pd.to_datetime(df_b['StartTime_b'])
df_a['EndTime_a'] = pd.to_datetime(df_a['EndTime_a'])
df_b['EndTime_b'] = pd.to_datetime(df_b['EndTime_b'])
##before using merge_asof sorting is needed
df_a.sort_values(by='StartTime_a', inplace=True)
df_b.sort_values(by='StartTime_b', inplace=True)
##merge and filter based on first condition
cond_1 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='forward')
cond_1 = cond_1[cond_1['StartTime_b'] <= cond_1['EndTime_a']]
##merge and filter based on second condition
cond_2 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='backward')
cond_2 = cond_2[cond_2['EndTime_b'] <= cond_2['EndTime_a']]
##merge and filter based on third condition
cond_3 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='forward')
cond_3 = cond_3[cond_3['StartTime_b'] <= cond_3['EndTime_a']]
cond_3 = cond_3[cond_3['EndTime_b'] >= cond_3['EndTime_a']]
##concatenating all matches
res_df = pd.concat([cond_1, cond_2, cond_3], sort=False)
def check_condition(x):
df_1 = df_a[(df_a['StartTime_a'] < x.StartTime_b) & (df_a['EndTime_a'] > x.EndTime_b)]
df_2 = df_a[(df_a['StartTime_a'] > x.StartTime_b) & (df_a['EndTime_a'] > x.EndTime_b)]
df_3 = df_a[(df_a['StartTime_a'] < x.StartTime_b) & (df_a['EndTime_a'] > x.StartTime_b)
& (df_a['EndTime_a'] < x.EndTime_b)]
if df_1.shape[0]+df_2.shape[0] + df_3.shape[0] !=0:
return 1
else:
return 0
df_b['c'] = df_b.apply(lambda x: check_condition(x), axis=1)
def检查_条件(x):
df_1=df_a[(df_a['StartTime_a']x.EndTime_b)]
df_2=df_a[(df_a['StartTime_a']>x.StartTime_b)和(df_a['EndTime_a']>x.EndTime_b)]
df_3=df_a[(df_a['StartTime_a']x.StartTime_b)
&(df_a['EndTime_a']
如果要避免合并,请执行以下操作:
df_a['StartTime_a'] = pd.to_datetime(df_a['StartTime_a'])
df_b['StartTime_b'] = pd.to_datetime(df_b['StartTime_b'])
df_a['EndTime_a'] = pd.to_datetime(df_a['EndTime_a'])
df_b['EndTime_b'] = pd.to_datetime(df_b['EndTime_b'])
##before using merge_asof sorting is needed
df_a.sort_values(by='StartTime_a', inplace=True)
df_b.sort_values(by='StartTime_b', inplace=True)
##merge and filter based on first condition
cond_1 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='forward')
cond_1 = cond_1[cond_1['StartTime_b'] <= cond_1['EndTime_a']]
##merge and filter based on second condition
cond_2 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='backward')
cond_2 = cond_2[cond_2['EndTime_b'] <= cond_2['EndTime_a']]
##merge and filter based on third condition
cond_3 = pd.merge_asof(df_a, df_b, by='Station', left_on='StartTime_a',
right_on='StartTime_b', direction='forward')
cond_3 = cond_3[cond_3['StartTime_b'] <= cond_3['EndTime_a']]
cond_3 = cond_3[cond_3['EndTime_b'] >= cond_3['EndTime_a']]
##concatenating all matches
res_df = pd.concat([cond_1, cond_2, cond_3], sort=False)
def check_condition(x):
df_1 = df_a[(df_a['StartTime_a'] < x.StartTime_b) & (df_a['EndTime_a'] > x.EndTime_b)]
df_2 = df_a[(df_a['StartTime_a'] > x.StartTime_b) & (df_a['EndTime_a'] > x.EndTime_b)]
df_3 = df_a[(df_a['StartTime_a'] < x.StartTime_b) & (df_a['EndTime_a'] > x.StartTime_b)
& (df_a['EndTime_a'] < x.EndTime_b)]
if df_1.shape[0]+df_2.shape[0] + df_3.shape[0] !=0:
return 1
else:
return 0
df_b['c'] = df_b.apply(lambda x: check_condition(x), axis=1)
def检查_条件(x):
df_1=df_a[(df_a['StartTime_a']x.EndTime_b)]
df_2=df_a[(df_a['StartTime_a']>x.StartTime_b)和(df_a['EndTime_a']>x.EndTime_b)]
df_3=df_a[(df_a['StartTime_a']x.StartTime_b)
&(df_a['EndTime_a']
您好,是的,您的解决方案很有效,但不幸的是,我的数据量大得多,我的dfu b大约有150万行,dfu a大约有50k,所以我在合并步骤中出现了内存错误。您还有其他想法吗?谢谢您的一切!所以如果您有内存错误,我会批量计算它。我的意思是运行我建议的代码by每次选择站点子集嗨,是的,你的解决方案有效,但不幸的是,我有更大的数据,我的df_b约为1,5m行,df_a约为50k,所以我在合并步骤中出现内存错误。你有其他想法吗?谢谢你所做的一切!如果你有内存错误,我会做的是批量计算。我我的意思是通过每次选择站点子集来运行我建议的代码谢谢你的回答,但是