Python 3.x 标识数据帧中满足三个条件的两行组

Python 3.x 标识数据帧中满足三个条件的两行组,python-3.x,pandas,if-statement,filter,haversine,Python 3.x,Pandas,If Statement,Filter,Haversine,我有下面的df,希望确定满足以下所有条件的任意两个订单: 皮卡之间的距离小于X英里 落差之间的距离减去Y英里 订单创建时间与Z分钟之间的差异 将使用haversine import haversine计算每行的提货差异和每行或订单的下降差异 我目前拥有的df如下所示: DAY  Order pickup_lat pickup_long dropoff_lat dropoff_long created_time 1/3/19 234e 32.69 -11

我有下面的df,希望确定满足以下所有条件的任意两个订单:

皮卡之间的距离小于X英里 落差之间的距离减去Y英里 订单创建时间与Z分钟之间的差异 将使用haversine import haversine计算每行的提货差异和每行或订单的下降差异

我目前拥有的df如下所示:

  DAY   Order  pickup_lat  pickup_long     dropoff_lat dropoff_long  created_time
 1/3/19  234e    32.69        -117.1          32.63      -117.08   3/1/19 19:00
 1/3/19  235d    40.73        -73.98          40.73       -73.99   3/1/19 23:21
 1/3/19  253w    40.76        -73.99          40.76       -73.99   3/1/19 15:26
 2/3/19  231y    36.08        -94.2           36.07       -94.21   3/2/19 0:14
 3/3/19  305g    36.01        -78.92          36.01       -78.95   3/2/19 0:09
 3/3/19  328s    36.76        -119.83         36.74       -119.79  3/2/19 4:33
 3/3/19  286n    35.76        -78.78          35.78       -78.74   3/2/19 0:43
我希望我的输出df是满足上述条件的任意2个订单或行。我不确定的是如何计算数据帧中的每一行返回满足这些条件的任意两行


我希望我正确地解释了我想要的输出。谢谢你的关注

我不知道这是否是一个最佳的解决方案,但我没有想出什么不同的办法。我所做的:

创建了具有所有可能订单组合的数据框, 计算了所有需要的度量值,对于所有的组合,我将这些度量值列添加到数据帧中, 找到满足上述条件的行的索引。 守则:

#create dataframe with all combination 
from itertools import combinations

index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)

from haversine import haversine

def distance(row):
    loc_0 = (row[0],row[1]) # (lat, lon)
    loc_1 = (row[2],row[3])
    return haversine(loc_0,loc_1,unit='mi')

#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)

#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)

#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')

#Thresholds
Z = 600
Y = 400
X = 400

#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X =  combined["pickup_dist_mi"]<X
dropoff_dist_Y =  combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]

希望我能很好地理解你,这会有所帮助。

我不知道这是否是一个最佳的解决方案,但我没有想出什么不同的办法。我所做的:

创建了具有所有可能订单组合的数据框, 计算了所有需要的度量值,对于所有的组合,我将这些度量值列添加到数据帧中, 找到满足上述条件的行的索引。 守则:

#create dataframe with all combination 
from itertools import combinations

index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)

from haversine import haversine

def distance(row):
    loc_0 = (row[0],row[1]) # (lat, lon)
    loc_1 = (row[2],row[3])
    return haversine(loc_0,loc_1,unit='mi')

#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)

#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)

#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')

#Thresholds
Z = 600
Y = 400
X = 400

#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X =  combined["pickup_dist_mi"]<X
dropoff_dist_Y =  combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]

希望我能很好地理解您,这将对您有所帮助。

如上所述使用您的数据帧。删除索引。我假设您创建的时间列是datetime格式的

import pandas as pd
from geopy.distance import geodesic
交叉合并数据帧以获得所有可能的“顺序”组合

df_all = pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)
删除订单相等的所有行

df_all = df_all[-(df_all['Order_x'] == df_all['Order_y'])].copy()
删除顺序_x,顺序_y==[a,b]和[b,a]的重复行

# drop duplicate rows
# first combine Order_x and Order_y into a sorted list, and combine into a string
df_all['dup_order'] = df_all[['Order_x', 'Order_y']].values.tolist()
df_all['dup_order'] = df_all['dup_order'].apply(lambda x: "".join(sorted(x)))

# drop the duplicates and reset the index
df_all = df_all.drop_duplicates(subset=['dup_order'], keep='first')
df_all.reset_index(drop=True)
创建一列以分钟为单位计算时间差

df_all['time'] = (df_all['dt_ceated_x'] - df_all['dt_ceated_y']).abs().astype('timedelta64[m]')
创建柱并计算落差之间的距离

df_all['dropoff'] = df_all.apply(
    (lambda row: geodesic(
        (row['dropoff_lat_x'], row['dropoff_long_x']),
        (row['dropoff_lat_x'], row['dropoff_long_y'])
    ).miles),
    axis=1
)
创建列并计算拾取之间的距离

df_all['pickup'] = df_all.apply(
    (lambda row: geodesic(
        (row['pickup_lat_x'], row['pickup_long_x']),
        (row['pickup_lat_x'], row['pickup_long_y'])
    ).miles),
    axis=1
)
根据需要过滤结果

X = 1500
Y = 2000
Z = 100

mask_pickups = df_all['pickup'] < X
mask_dropoff = df_all['dropoff'] < Y
mask_time = df_all['time'] < Z

print(df_all[mask_pickups & mask_dropoff & mask_time][['Order_x', 'Order_y', 'time', 'dropoff', 'pickup']])

Order_x Order_y  time      dropoff       pickup
10    235d    231y  53.0  1059.026620  1059.026620
11    235d    305g  48.0   260.325370   259.275948
13    235d    286n  82.0   249.306279   251.929905
25    231y    305g   5.0   853.308110   854.315567
27    231y    286n  29.0   865.026077   862.126593
34    305g    286n  34.0    11.763787     7.842526

如上所述使用数据帧。删除索引。我假设您创建的时间列是datetime格式的

import pandas as pd
from geopy.distance import geodesic
交叉合并数据帧以获得所有可能的“顺序”组合

df_all = pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)
删除订单相等的所有行

df_all = df_all[-(df_all['Order_x'] == df_all['Order_y'])].copy()
删除顺序_x,顺序_y==[a,b]和[b,a]的重复行

# drop duplicate rows
# first combine Order_x and Order_y into a sorted list, and combine into a string
df_all['dup_order'] = df_all[['Order_x', 'Order_y']].values.tolist()
df_all['dup_order'] = df_all['dup_order'].apply(lambda x: "".join(sorted(x)))

# drop the duplicates and reset the index
df_all = df_all.drop_duplicates(subset=['dup_order'], keep='first')
df_all.reset_index(drop=True)
创建一列以分钟为单位计算时间差

df_all['time'] = (df_all['dt_ceated_x'] - df_all['dt_ceated_y']).abs().astype('timedelta64[m]')
创建柱并计算落差之间的距离

df_all['dropoff'] = df_all.apply(
    (lambda row: geodesic(
        (row['dropoff_lat_x'], row['dropoff_long_x']),
        (row['dropoff_lat_x'], row['dropoff_long_y'])
    ).miles),
    axis=1
)
创建列并计算拾取之间的距离

df_all['pickup'] = df_all.apply(
    (lambda row: geodesic(
        (row['pickup_lat_x'], row['pickup_long_x']),
        (row['pickup_lat_x'], row['pickup_long_y'])
    ).miles),
    axis=1
)
根据需要过滤结果

X = 1500
Y = 2000
Z = 100

mask_pickups = df_all['pickup'] < X
mask_dropoff = df_all['dropoff'] < Y
mask_time = df_all['time'] < Z

print(df_all[mask_pickups & mask_dropoff & mask_time][['Order_x', 'Order_y', 'time', 'dropoff', 'pickup']])

Order_x Order_y  time      dropoff       pickup
10    235d    231y  53.0  1059.026620  1059.026620
11    235d    305g  48.0   260.325370   259.275948
13    235d    286n  82.0   249.306279   251.929905
25    231y    305g   5.0   853.308110   854.315567
27    231y    286n  29.0   865.026077   862.126593
34    305g    286n  34.0    11.763787     7.842526

谢谢:当我运行代码时,我的笔记本在这一行继续运行,内核出现故障,这段代码有什么问题吗?index_comb=listcombinationsdf.index,2@rafzy15是否有任何解决方法,以便在运行该行时内核不会失败?也许因为我的df比示例df大?嘿,我对这段代码没有任何问题:所以,我想这可能与内存有关。您是否尝试使用相同的数据,但数量较少(例如,仅处理一半的数据集)?我建议您将代码更改为使用generator-combinationstrips.index 2,而不是创建列表。我稍后会看这个。是的,这是数据集大小约为500k行的prob内存相关bc。将代码更改为那样会有帮助吗?我正在努力!感谢您的回复:@Rafzy15It通过您在评论中的编辑处理了代码-但现在似乎被困在这里了-orders1=pd.DataFrame[df.loc[c[0],:]。索引中c的值,\u comb],columns=df.columns,index=index\u comb抱歉打扰@rafzy15谢谢:当我运行代码时,我的笔记本在这一行继续运行,内核出现故障,这段代码有什么问题吗?index_comb=listcombinationsdf.index,2@rafzy15是否有任何解决方法,以便在运行该行时内核不会失败?也许因为我的df比示例df大?嘿,我对这段代码没有任何问题:所以,我想这可能与内存有关。您是否尝试使用相同的数据,但数量较少(例如,仅处理一半的数据集)?我建议您将代码更改为使用generator-combinationstrips.index 2,而不是创建列表。我稍后会看这个。是的,这是数据集大小约为500k行的prob内存相关bc。将代码更改为那样会有帮助吗?我正在努力!感谢您的回复:@Rafzy15It在您的评论中用您的编辑处理了代码-但现在似乎被困在这里-orders1=pd.DataFrame[df.loc[c[0],:].index_comb中c的值,columns=df.columns,index=index_comb抱歉打扰@Rafzy15I尝试两种解决方案,但是当我尝试计算组合的两行代码时,我的笔记本内核一直在崩溃
ng。有没有办法解决这个问题@我尝试两种解决方案,但当我尝试两行计算组合的代码时,我的笔记本内核不断崩溃。有没有办法解决这个问题@耗尽