Pandas 在两个大数据帧上迭代,以拉取值的方式进行矢量化?
我有一个1毫米行的数据框,看起来像这样Pandas 在两个大数据帧上迭代,以拉取值的方式进行矢量化?,pandas,loops,optimization,iterator,vectorization,Pandas,Loops,Optimization,Iterator,Vectorization,我有一个1毫米行的数据框,看起来像这样 shipname timestamp 0 11/1/2019 0 11/2/2019 ... ... 100 10/1/2018 我有第二个数据帧,它有一系列数据,如下所示 shipname dateorigin datedestination 0 10/1/20
shipname timestamp
0 11/1/2019
0 11/2/2019
... ...
100 10/1/2018
我有第二个数据帧,它有一系列数据,如下所示
shipname dateorigin datedestination
0 10/1/2019 10/5/2019
0 10/20/2019 11/10/2019
...
99 11/1/2019 11/20/2019
我想运行一个函数,如果shipname在DataFrame 2中,并且时间戳在dateorigin和datedestination之间,则返回DF2中的索引
目前我正在使用df.iterrows来完成这项工作,但这会减慢我的PC速度,并使python几乎无法使用。另外,在某些情况下,DF2中的值可能大于1,这是真的(在这种情况下,我只想返回第一个值)。到目前为止,我一直在使用代码
for t in shipbase.itertuples():
try:
idx = (t.shipname== df.shipname) & (t.Timestamp >= df.DateOrigin) & (
t.Timestamp <= df.DateDestination)
list_index.append(df.loc[idx].index.values)
except ValueError:
list_index.append(np.nan)
print(t)
any help to get this code to work better / optimize would be greatly appreciated. I have been trying to vectorize, but cant think of an easy solution.
用于shippase.itertuples()中的t:
尝试:
idx=(t.shipname==df.shipname)&(t.Timestamp>=df.DateOrigin)&(
t、 时间戳如果内存没有用完,可以尝试以下操作:
df = pd.merge(df1, df2, how='inner', on='shipname')
# If you can do the merge, and run out of memory after, try to delete df1 and df2 by
# del df1, df2
df= df[df['timestamp'].between(df['dateorigin'], df['datedestination'])]
请注意,pd.merge
可以复制某些行,因为shipname
值在两个数据帧中看起来都不唯一请参见merge+series.between