比较python中巨大的二维列表中一个列表的值，最快的方法是什么？_Python_Performance_List_Comparison_List Comprehension

比较python中巨大的二维列表中一个列表的值，最快的方法是什么？

python performance list

比较python中巨大的二维列表中一个列表的值，最快的方法是什么？,python,performance,list,comparison,list-comprehension,Python,Performance,List,Comparison,List Comprehension,我想比较一个列表的值是否存在于另一个列表的值中。它们是巨大的（来自数据库的50k+项）编辑：我还想将复制的记录标记为duplicate=True，并将它们保存在表中以供以后参考以下是列表的内容： n_emails=[db_id,checksum for id,checksum in search_results] #I want to compare checksum if exist inside same list or other list and retrieve id (db_i

我想比较一个列表的值是否存在于另一个列表的值中。它们是巨大的（来自数据库的50k+项）

编辑：我还想将复制的记录标记为duplicate=True，并将它们保存在表中以供以后参考

以下是列表的内容：

n_emails=[db_id,checksum for id,checksum in search_results]
#I want to compare checksum if exist inside same list or other list and retrieve id (db_id , if exist)
#example : n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds],[3,'deadbeef101'],[5,'CAFEBABE010']] 
#in this case i want to retrive id 1 and 5 coz they are same checksum

for m in n_emails:
    dups=_getdups(n_emails,m[1],m[0])           
    n_dups=[casesdb.duplicates.insert( **dup ) for dup in dups]
    if n_dups:
        print "Dupe Found"
        casesdb(casesdb.email_data.id == m[0]).update(duplicated=True)

def _getdups(old_lst,em_md5,em_id):
    dups=[]
    for old in old_lst:
        if em_md5==old[0] and old[1]!=em_id:
            dups.append(dict(org_id=old[1],md5hash=old[0],dupID=em_id,))
    return dups

但它似乎太长了，在更大的列表中（50k对50k记录+），它运行了5000多秒，从未完成，似乎永无止境的循环？我运行的服务器有4 GB的ram和4个内核。显然我做错了什么

请帮忙。。非常感谢

解决了的： Dict索引映射要快得多！（当mysql表未被索引时，请注意我没有对索引表进行测试）

它的20秒对30毫秒=20*1000/30=666次！LOL

您最好使用SQL查找重复项。例如，请参见

将所有这些结果拉入Python并对其进行处理永远不会很快，但如果必须这样做，最好的办法是为ID创建一个校验和字典：

got_checksums = {}
for id, checksum in emails:
    if checksum in got_checksums:
        print id, got_checksums[checksum]
    else:
        got_checksums[checksum] = id

您最好使用SQL查找重复项。例如，请参见

将所有这些结果拉入Python并对其进行处理永远不会很快，但如果必须这样做，最好的办法是为ID创建一个校验和字典：

got_checksums = {}
for id, checksum in emails:
    if checksum in got_checksums:
        print id, got_checksums[checksum]
    else:
        got_checksums[checksum] = id

最快的方法是使用这样的dict：

n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds'],[3,'deadbeef101'],[5,'CAFEBABE010']]

d = {}
for id, hash in n_emails:
    if hash not in d:
        d[hash] = [id]
    else:
        d[hash].append(id)

for hash, ids in d:
    if len(ids) > 1:
       print hash, ids

这几乎就是散列连接的算法

这将是本文中的sql/python解决方案，我使用duplicate列并使用它存储这条被认为是重复的消息

电子邮件表至少应为：

create table emails (id, hash, duplicate default null)

最快的方法是使用这样的dict：

n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds'],[3,'deadbeef101'],[5,'CAFEBABE010']]

d = {}
for id, hash in n_emails:
    if hash not in d:
        d[hash] = [id]
    else:
        d[hash].append(id)

for hash, ids in d:
    if len(ids) > 1:
       print hash, ids

这几乎就是散列连接的算法

这将是本文中的sql/python解决方案，我使用duplicate列并使用它存储这条被认为是重复的消息

电子邮件表至少应为：

create table emails (id, hash, duplicate default null)

你做错的是：

您可能可以直接从数据库中获得结果。它比Python快得多
您正在对校验和进行线性搜索，这意味着每个50k项都会与其他50k项进行比较。。。这是大量的比较

你应该做的是在校验和上建立一个索引。制作映射

校验和->条目的dict

。插入条目时，请检查校验和是否已存在，如果已存在，则该条目是重复的

或者你只是使用你的数据库，他们喜欢索引。

你做错了的是：

您可能可以直接从数据库中获得结果。它比Python快得多
您正在对校验和进行线性搜索，这意味着每个50k项都会与其他50k项进行比较。。。这是大量的比较

你应该做的是在校验和上建立一个索引。制作映射

校验和->条目的dict

。插入条目时，请检查校验和是否已存在，如果已存在，则该条目是重复的

或者你只是使用数据库，他们喜欢索引。

最后感谢所有的答案，我发现dict映射速度快得惊人！比SQL查询快得多

这是我的SQL查询测试（看起来很尴尬，但这是Web2pyDAL查询的语法）

我对3500条记录进行了测试，仅对250000条记录进行了dict映射

print "de_duping started at %s" % str( datetime.datetime.now() )

dupe_n = 0
l_dupe_n = 0
for em_hash in n_emails:
    dup_ids=casesdb(casesdb.email_data.MD5Hash==em_hash[1]).select(casesdb.email_data.id)
    if dup_ids>1:
        dupe_n+=1

print "Email Dupes %s" % (dupe_n)
print "Local de_duping ended at %s" % str( datetime.datetime.now() )

结果如下：

de_duping started at 2010-12-02 03:39:24.610888
Email Dupes 3067
Local de_duping ended at 2010-12-02 03:39:52.669849

大约28秒以下是基于Dan D的基于dict的索引图

    print "de_duping started at %s" % str( datetime.datetime.now() )
    for id, hash in em_hash:

            if hash not in dedupe_emails:

                dedupe_emails[hash] = [id]
            else:

                dedupe_emails[hash].append( id )
                dupe_n += 1
                casesdb( casesdb.email_data.id == id ).update( duplicated = True )

    print "Email Dupes %s" % (dupe_n)
    print "Local de_duping ended at %s" % str( datetime.datetime.now() )

结果：

de_duping started at 2010-12-02 03:41:21.505235
Email Dupes 2591 # this is accurate as selecting from database regards first match as duplicate too
Local de_duping ended at 2010-12-02 03:41:21.531899

只有什么？30毫秒

让我们看看它对重复数据消除25000条记录做了什么

de_duping at 2010-12-02 03:44:20.120880
Email Dupes 93567 
Local de_duping ended at 2010-12-02 03:45:12.612449

不到一分钟

感谢所有的答案，我想选择所有给我指明正确方向的人，但丹D会给我最详细的答案！谢谢你，丹

最后多亏了所有的答案，我发现dict映射速度快得离谱！比SQL查询快得多