Python 如何快速比较非常大的数据帧

Python 如何快速比较非常大的数据帧,python,pandas,dataframe,numpy,Python,Pandas,Dataframe,Numpy,我有两行数千万的数据帧df1和df2,df1是主要的数据帧。两者都有两列:date和code。重复数据消除后,大约有5000行date和code行日期+代码不重复。如何使用最快的方法找出df1中有date+code,而df2中没有data+code? 我现在使用的方法是for循环日期,与np.setdiff1d相比,大约每秒3个date,如何优化它 我的代码如下: import pandas as pd import numpy as np from tqdm import tqdm ....

我有两行数千万的数据帧
df1
df2
df1
是主要的数据帧。两者都有两列:
date
code
。重复数据消除后,大约有
5000行
date
code
行<代码>日期+代码
不重复。如何使用最快的方法找出
df1
中有
date+code
,而df2中没有
data+code
? 我现在使用的方法是for循环日期,与
np.setdiff1d
相比,大约每秒3个
date
,如何优化它

我的代码如下:

import pandas as pd
import numpy as np
from tqdm import tqdm

......

df1['s'] = df1.date + df1.code
df2['s'] = df2.date + df2.code

df = pd.DataFrame()

dates = df1.date.drop_duplicates().sort_values(ascending=False)

for date in tqdm(dates, total=len(dates)):
    t1 = df1[df1.date == date]
    t2 = df2[df1.date == date]
    s = np.setdiff1d(t1.s, t2.s)
    if len(s) == 0:
        continue
    t = pd.DataFrame(s)
    pd.concat([df, t])


print(f'len of df1:{len(df1)}')
print(f'len of df2:{len(df2)}')

print(f'columns of df1:{df1.columns}')
print(f'columns of df2:{df2.columns}')

print(f'df1 len of codes drop_duplicates:{len(df1.code.drop_duplicates())}')
print(f'df2 len of codes drop_duplicates:{len(df2.code.drop_duplicates())}')

print(f'df1 len of dates drop_duplicates:{len(df1.date.drop_duplicates())}')
print(f'df2 len of dates drop_duplicates:{len(df2.date.drop_duplicates())}')

print(df1.head(10))
print(df2.head(10))
直接使用
np.setdiff1d
可能更快,但估计总时间太长

import pandas as pd
import numpy as np
from tqdm import tqdm

......

df1['s'] = df1.date + df1.code
df2['s'] = df2.date + df2.code
s=np.setdiff1d(df1.s,df2.s)
样本数据如下:

import pandas as pd
import numpy as np
from tqdm import tqdm

......

df1['s'] = df1.date + df1.code
df2['s'] = df2.date + df2.code

df = pd.DataFrame()

dates = df1.date.drop_duplicates().sort_values(ascending=False)

for date in tqdm(dates, total=len(dates)):
    t1 = df1[df1.date == date]
    t2 = df2[df1.date == date]
    s = np.setdiff1d(t1.s, t2.s)
    if len(s) == 0:
        continue
    t = pd.DataFrame(s)
    pd.concat([df, t])


print(f'len of df1:{len(df1)}')
print(f'len of df2:{len(df2)}')

print(f'columns of df1:{df1.columns}')
print(f'columns of df2:{df2.columns}')

print(f'df1 len of codes drop_duplicates:{len(df1.code.drop_duplicates())}')
print(f'df2 len of codes drop_duplicates:{len(df2.code.drop_duplicates())}')

print(f'df1 len of dates drop_duplicates:{len(df1.date.drop_duplicates())}')
print(f'df2 len of dates drop_duplicates:{len(df2.date.drop_duplicates())}')

print(df1.head(10))
print(df2.head(10))
由于@Andrej Kesely,df.merge速度非常快

df1['a'] = 1
df2['b'] = 1

print('start merge')
start = time()
df = pd.merge(df1, df2, how='outer', on=['date', 'code'])
stop = time()
print(f'merge time:{stop - start} s')

df = df[df.a.isnull() | df.b.isnull()]

print('start saving df')
start = time()
df.to_csv('df.csv', index=False)
stop = time()
print(f'saving time: {stop - start} s')
结果是:

start merge
merge time: 11.73848009109497 s
start saving df
saving time: 0.3139007091522217 s

那么
df1
有~5000列,而
df2
有~5000列?您是否尝试过
.merge()
?是否可以添加示例输入/输出数据帧?