Python 在数据帧上执行复杂搜索的最快方法
我试图找出在熊猫数据帧上执行搜索和排序的最快方法。下面是我试图完成的前后数据帧 之前:Python 在数据帧上执行复杂搜索的最快方法,python,pandas,binary-search-tree,Python,Pandas,Binary Search Tree,我试图找出在熊猫数据帧上执行搜索和排序的最快方法。下面是我试图完成的前后数据帧 之前: flightTo flightFrom toNum fromNum toCode fromCode ABC DEF 123 456 8000 8000 DEF XYZ 456 893 9999 9999 AAA BBB 473 917 5555
flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555
搜索/排序后:
flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555
flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999
在这个例子中,我基本上是想过滤掉存在于目的地之间的“航班”。这应该通过使用某种dropduplicates方法来完成,但让我困惑的是如何处理所有列。二进制搜索是实现这一点的最佳方式吗?我很感激你,努力想弄明白这一点
可能的边缘情况:
如果数据被切换,并且我们的端连接在同一列中,该怎么办
flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999
搜索/排序后:
flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555
flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999
从逻辑上讲,这种情况不应该发生。毕竟,你怎么能去DEF-ABC和DEF-XYZ?您不能,但“端点”仍然是ABC-XYZ这是网络问题,因此我们使用
networkx
,注意,这里您可以有两个以上的停止点,这意味着您可以有类似NY-DC-WA-NC
import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')
# create the nx object from pandas dataframe
l=list(nx.connected_components(G))
# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others
d={k: v for d in L for k, v in d.items()}
# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output
Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555
安装networkx
- Pip:
Pip安装networkx
- 巨蟒:
conda安装-c巨蟒网络x
- 这里有一个NumPy解决方案,在性能相关的情况下可能会很方便:
def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
keep_cols = np.broadcast_to([3,5],(len(col),2))
ixs = np.concatenate([col[:,None], keep_cols], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)
建议的数据帧的哪个fo产生预期输出:
print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 DEF XYZ 456 893 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555
这种方法不假定重复项所在行的任何特定顺序,同样适用于列(以涵盖问题中描述的边缘情况)。例如,如果我们使用以下数据帧:
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555
连接航班在数据帧中总是相邻的吗?np.where(条件)df['flightFrom']怎么样df['fightTo']?@Mike这些信息在DataFrame@IanS检查
fromNum,fromCode
expected output中的值,这就是使这个问题变得复杂的原因。回答得好!查看networkx几次,现在将做更多@Erfan这更像是链接键,这个答案应该在更多的解释中被分解:)(所以我可以从中学习,呵呵)我读过的最好的答案。是否可以使用信息名称而不是字母来编辑变量,并展开解决方案。或者最好在媒体(或其他地方)上写一篇文章来解释这一点methodology@MaxB,我只能说将您的数据帧拆分为两个,一个是正常网络,另一个是您的边缘情况,使用df1=df[df.ID.duplicated(keep=False)];df2=df2.下降(df1.指数);df1.groupby('flightFrom')。agg(..),df2遵循上述步骤