Python 3.x 按元素比较两个列表的更快方法

Python 3.x 按元素比较两个列表的更快方法,python-3.x,database,Python 3.x,Database,我正在使用python构建一个关系数据库。到目前为止,我有两个表格,如下所示: >>> df_Patient.columns [1] Index(['NgrNr', 'FamilieNr', 'DosNr', 'Geslacht', 'FamilieNaam', 'VoorNaam', 'GeboorteDatum', 'PreBirth'], dtype='object') >>> df_LaboRequest.columns

我正在使用python构建一个关系数据库。到目前为止,我有两个表格,如下所示:

>>> df_Patient.columns
[1] Index(['NgrNr', 'FamilieNr', 'DosNr', 'Geslacht', 'FamilieNaam', 'VoorNaam',
       'GeboorteDatum', 'PreBirth'],
      dtype='object')


>>> df_LaboRequest.columns
[2] Index(['RequestId', 'IsComplete', 'NgrNr', 'Type', 'RequestDate', 'IntakeDate',
       'ReqMgtUnit'],
      dtype='object')
这两张桌子相当大:

>>> df_Patient.shape
[3] (386249, 8)

>>> df_LaboRequest.shape
[4] (342225, 7)
df_laborarequest
if外键(FK)上的列
NgrNr
,并引用
df_Patient
上的同名列。为了避免任何完整性错误,我需要确保
df_labourequest[NgrNr]
下的所有值都在
df_Patient[NgrNr]

通过列表理解,我尝试了以下方法(选择会引发错误的值):


虽然这需要很长时间才能完成。有人会推荐一种更快的方法(方法作为一个通用词,作为过程的同义词,与方法的pythonic含义无关)来进行这种比较吗?

我现在没有安装pandas来尝试这种方法。但是您可以尝试删除
列表(..)
cast。我认为它对程序没有任何意义,集合的查找速度比列表快得多,例如,
x in set(…)

您还可以尝试使用pandas API而不是列表和集合来完成此操作,有时速度更快。尝试搜索
唯一的
。然后您可以比较两列的大小,如果相同,则对它们进行排序并进行相等性检查

  • 一句话并不总是好的

  • 不要检查列表中的成员资格。你究竟为什么要创建一个
    集合
    (这是O(1)个成员资格检查的推荐数据结构),然后将其强制转换为一个包含O(N)个成员资格检查的列表

  • df_Patient
    的集合在列表理解之外一次,并使用该集合,而不是在每次迭代中创建集合

  • 或者,如果您喜欢使用,只需找到:

  • 或者,使用
    isin()
    函数
  • 让我们测试一下这些更改使代码更快:

    import pandas as pd
    import random
    import timeit
    
    # Make dummy dataframes of patients and lab_requests
    randoms = [random.randint(1, 1000) for _ in range(10000)]
    
    patients = pd.DataFrame("patient{0}".format(x) for x in randoms[:5000])[0]
    lab_requests = pd.DataFrame("patient{0}".format(x) for x in randoms[2000:8000])[0]
    
    # Do it your way
    def fun1(pat, lr): 
        return [x for x in list(set(lr)) if x not in list(set(pat))]
    
    # Do it my way: Set operations
    def fun2(pat, lr):
        pat_s = set(pat)
        lr_s = set(lr)
        return lr_s - pat_s
    
    # Or explicitly iterate over the set
    def fun3(pat, lr):
        pat_s = set(pat)
        lr_s = set(lr)
        return [x for x in lr_s if x not in pat_s]
    
    # Or using pandas
    def fun4(pat, lr):
        pat = pat.drop_duplicates()
        lr = lr.drop_duplicates()
        return lr[~lr.isin(pat)]
    
    # Make sure all 3 functions return the same thing
    assert set(fun1(patients, lab_requests)) == set(fun2(patients, lab_requests)) == set(fun3(patients, lab_requests)) == set(fun4(patients, lab_requests))
    
    # Time it
    timeit.timeit('fun1(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun1', number=100)
    # Output: 48.36615000000165
    
    timeit.timeit('fun2(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun2', number=100)
    # Output: 0.10799920000044949
    
    timeit.timeit('fun3(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun3', number=100)
    # Output: 0.11038020000069082
    
    timeit.timeit('fun4(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun4', number=100)
    # Output: 0.32021789999998873
    
    看起来我们在
    pandas
    上有~150倍的加速比,在设置操作上有~500倍的加速比

    patients = set(df_Patient['NgrNr'])
    lab_requests = set(df_LaboRequest['NgrNr'])
    result = [x for x in lab_requests if x not in patients]
    
    result = lab_requests - patients
    
    patients = patients.drop_duplicates()
    lab_requests = lab_requests.drop_duplicates()
    result = lab_requests[~lab_requests.isin(patients)]
    
    import pandas as pd
    import random
    import timeit
    
    # Make dummy dataframes of patients and lab_requests
    randoms = [random.randint(1, 1000) for _ in range(10000)]
    
    patients = pd.DataFrame("patient{0}".format(x) for x in randoms[:5000])[0]
    lab_requests = pd.DataFrame("patient{0}".format(x) for x in randoms[2000:8000])[0]
    
    # Do it your way
    def fun1(pat, lr): 
        return [x for x in list(set(lr)) if x not in list(set(pat))]
    
    # Do it my way: Set operations
    def fun2(pat, lr):
        pat_s = set(pat)
        lr_s = set(lr)
        return lr_s - pat_s
    
    # Or explicitly iterate over the set
    def fun3(pat, lr):
        pat_s = set(pat)
        lr_s = set(lr)
        return [x for x in lr_s if x not in pat_s]
    
    # Or using pandas
    def fun4(pat, lr):
        pat = pat.drop_duplicates()
        lr = lr.drop_duplicates()
        return lr[~lr.isin(pat)]
    
    # Make sure all 3 functions return the same thing
    assert set(fun1(patients, lab_requests)) == set(fun2(patients, lab_requests)) == set(fun3(patients, lab_requests)) == set(fun4(patients, lab_requests))
    
    # Time it
    timeit.timeit('fun1(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun1', number=100)
    # Output: 48.36615000000165
    
    timeit.timeit('fun2(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun2', number=100)
    # Output: 0.10799920000044949
    
    timeit.timeit('fun3(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun3', number=100)
    # Output: 0.11038020000069082
    
    timeit.timeit('fun4(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun4', number=100)
    # Output: 0.32021789999998873