Python 将字典中与键关联的值列表与所有其他键关联的值进行比较
以下是我编写的代码,用于将与每个键关联的值列表与字典中的所有其他键进行比较。。。但是csv文件中的大约10000条记录花费了大量的时间。任何人都能帮助优化代码以在最短的时间内执行吗。。不要担心外部函数调用,它工作正常Python 将字典中与键关联的值列表与所有其他键关联的值进行比较,python,dictionary,Python,Dictionary,以下是我编写的代码,用于将与每个键关联的值列表与字典中的所有其他键进行比较。。。但是csv文件中的大约10000条记录花费了大量的时间。任何人都能帮助优化代码以在最短的时间内执行吗。。不要担心外部函数调用,它工作正常 import csv import sys file = sys.argv[1] with open(file, 'rU') as inf: csvreader=csv.DictReader(inf,delimiter=',') result={} temp
import csv
import sys
file = sys.argv[1]
with open(file, 'rU') as inf:
csvreader=csv.DictReader(inf,delimiter=',')
result={}
temp = []
#Creating Dict
for r in csvreader:
name=[]
name.append(r['FIRST_NAME'])
name.append(r['LAST_NAME'])
name.append(r['ID'])
result.setdefault(r['GROUP_KEY'],[]).append(name)
#Processing the Dict
for key1 in result.keys():
temp.append(key1)
for key2 in result.keys():
if key1 != key2 and key2 not in ex:
for v1 in result[key1]:
for v2 in result[key2]:
score=name_match_score(v1,'',v2,'')[0] ####calling external function
if score > 0.90:
print v1[2],v2[2],score
像这样的东西会有帮助。目标是通过跳过冗余计算和缓存执行的计算,减少在
name\u match\u score
中进行的原始计算的数量
首先,让您的字典存储元组列表的defaultdict。元组是不可变的,所以它们可以用作下面集合和dict中的键
from collections import defaultdict
import csv
import sys
file = sys.argv[1]
with open(file, 'rU') as inf:
csvreader=csv.DictReader(inf, delimiter=',')
result = defaultdict(list)
for r in csvreader:
name = (r['FIRST_NAME'], r['LAST_NAME'], r['ID'])
result[r['GROUP_KEY']].append(name)
然后,对密钥进行排序,以确保只对一对密钥求值一次
keys = sorted(result)
for i, key1 in enumerate(keys):
for key2 in keys[i+1:]:
并对v1
和v2
进行排序,以便它们形成一个唯一的键。这将有助于缓存
for v1 in result[key1]:
for v2 in result[key2]:
v1, v2 = (min(v1, v2), max(v1, v2))
score=name_match_score(v1, v2)[0] ####calling external function
if score > 0.90:
print v1[2],v2[2],score
然后使用缓存来缓存计算:
class memoized(object):
'''Decorator. Caches a function's return value each time it is called.
If called later with the same arguments, the cached value is returned
(not reevaluated).
'''
def __init__(self, func):
self.func = func
self.cache = {}
def __call__(self, *args):
if not isinstance(args, collections.Hashable):
# uncacheable. a list, for instance.
# better to not cache than blow up.
return self.func(*args)
if args in self.cache:
return self.cache[args]
else:
value = self.func(*args)
self.cache[args] = value
return value
def __repr__(self):
'''Return the function's docstring.'''
return self.func.__doc__
def __get__(self, obj, objtype):
'''Support instance methods.'''
return functools.partial(self.__call__, obj)
并更改name\u match\u score
以使用装饰器:
@memoized
def name_match_score(v1, v2):
# Whatever this does
return (0.75, )
这将最大限度地减少您所做的
name\u match\u score
中原始计算的数量。文件“test.py”,第30行,如果vkey不在vseen:TypeError:unhable type中:“list”没有帮助……需要更多的时间,因为它会反复检查集合。