Python 如何快速比较非常大的数据帧
我有两行数千万的数据帧Python 如何快速比较非常大的数据帧,python,pandas,dataframe,numpy,Python,Pandas,Dataframe,Numpy,我有两行数千万的数据帧df1和df2,df1是主要的数据帧。两者都有两列:date和code。重复数据消除后,大约有5000行date和code行日期+代码不重复。如何使用最快的方法找出df1中有date+code,而df2中没有data+code? 我现在使用的方法是for循环日期,与np.setdiff1d相比,大约每秒3个date,如何优化它 我的代码如下: import pandas as pd import numpy as np from tqdm import tqdm ....
df1
和df2
,df1
是主要的数据帧。两者都有两列:date
和code
。重复数据消除后,大约有5000行date
和code
行<代码>日期+代码
不重复。如何使用最快的方法找出df1
中有date+code
,而df2中没有data+code
?
我现在使用的方法是for循环日期,与np.setdiff1d
相比,大约每秒3个date
,如何优化它
我的代码如下:
import pandas as pd
import numpy as np
from tqdm import tqdm
......
df1['s'] = df1.date + df1.code
df2['s'] = df2.date + df2.code
df = pd.DataFrame()
dates = df1.date.drop_duplicates().sort_values(ascending=False)
for date in tqdm(dates, total=len(dates)):
t1 = df1[df1.date == date]
t2 = df2[df1.date == date]
s = np.setdiff1d(t1.s, t2.s)
if len(s) == 0:
continue
t = pd.DataFrame(s)
pd.concat([df, t])
print(f'len of df1:{len(df1)}')
print(f'len of df2:{len(df2)}')
print(f'columns of df1:{df1.columns}')
print(f'columns of df2:{df2.columns}')
print(f'df1 len of codes drop_duplicates:{len(df1.code.drop_duplicates())}')
print(f'df2 len of codes drop_duplicates:{len(df2.code.drop_duplicates())}')
print(f'df1 len of dates drop_duplicates:{len(df1.date.drop_duplicates())}')
print(f'df2 len of dates drop_duplicates:{len(df2.date.drop_duplicates())}')
print(df1.head(10))
print(df2.head(10))
直接使用np.setdiff1d
可能更快,但估计总时间太长
import pandas as pd
import numpy as np
from tqdm import tqdm
......
df1['s'] = df1.date + df1.code
df2['s'] = df2.date + df2.code
s=np.setdiff1d(df1.s,df2.s)
样本数据如下:
import pandas as pd
import numpy as np
from tqdm import tqdm
......
df1['s'] = df1.date + df1.code
df2['s'] = df2.date + df2.code
df = pd.DataFrame()
dates = df1.date.drop_duplicates().sort_values(ascending=False)
for date in tqdm(dates, total=len(dates)):
t1 = df1[df1.date == date]
t2 = df2[df1.date == date]
s = np.setdiff1d(t1.s, t2.s)
if len(s) == 0:
continue
t = pd.DataFrame(s)
pd.concat([df, t])
print(f'len of df1:{len(df1)}')
print(f'len of df2:{len(df2)}')
print(f'columns of df1:{df1.columns}')
print(f'columns of df2:{df2.columns}')
print(f'df1 len of codes drop_duplicates:{len(df1.code.drop_duplicates())}')
print(f'df2 len of codes drop_duplicates:{len(df2.code.drop_duplicates())}')
print(f'df1 len of dates drop_duplicates:{len(df1.date.drop_duplicates())}')
print(f'df2 len of dates drop_duplicates:{len(df2.date.drop_duplicates())}')
print(df1.head(10))
print(df2.head(10))
由于@Andrej Kesely,df.merge速度非常快
df1['a'] = 1
df2['b'] = 1
print('start merge')
start = time()
df = pd.merge(df1, df2, how='outer', on=['date', 'code'])
stop = time()
print(f'merge time:{stop - start} s')
df = df[df.a.isnull() | df.b.isnull()]
print('start saving df')
start = time()
df.to_csv('df.csv', index=False)
stop = time()
print(f'saving time: {stop - start} s')
结果是:
start merge
merge time: 11.73848009109497 s
start saving df
saving time: 0.3139007091522217 s
那么
df1
有~5000列,而df2
有~5000列?您是否尝试过.merge()
?是否可以添加示例输入/输出数据帧?