Python—重复数据消除的问题:TypeError:unhable type:';numpy.ndarray和#x27;
我无法运行重复数据消除。我正试图使用这个库来删除大量地址中的重复项。这是我的密码:Python—重复数据消除的问题:TypeError:unhable type:';numpy.ndarray和#x27;,python,python-2.7,numpy,python-dedupe,Python,Python 2.7,Numpy,Python Dedupe,我无法运行重复数据消除。我正试图使用这个库来删除大量地址中的重复项。这是我的密码: import collections import logging import optparse from numpy import nan import dedupe from unidecode import unidecode optp = optparse.OptionParser() optp.add_option('-v', '--verbose', dest='verbose', action
import collections
import logging
import optparse
from numpy import nan
import dedupe
from unidecode import unidecode
optp = optparse.OptionParser()
optp.add_option('-v', '--verbose', dest='verbose', action='count',
help='Increase verbosity (specify multiple times for more)'
)
(opts, args) = optp.parse_args()
log_level = logging.WARNING
if opts.verbose == 1:
log_level = logging.INFO
elif opts.verbose >= 2:
log_level = logging.DEBUG
logging.getLogger().setLevel(log_level)
input_file = 'H:/My Documents/Python Scripts/Dedupe/DupeTester.csv'
output_file = 'csv_example_output.csv'
settings_file = 'csv_example_learned_settings'
training_file = 'csv_example_training.json'
def preProcess(column):
import unidecode
column = column.decode("utf8")
column = unidecode.unidecode(column)
column = re.sub(' +', ' ', column)
column = re.sub('\n', ' ', column)
column = column.strip().strip('"').strip("'").lower().strip()
return column
def readData(filename):
data_d = {}
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
row_id = int(row[''])
data_d[row_id] = dict(clean_row)
return data_d
print 'importing data ...'
data_d = readData(input_file)
if os.path.exists(settings_file):
print 'reading from', settings_file
with open(settings_file, 'rb') as f:
deduper = dedupe.StaticDedupe(f)
else:
fields = [
{"field" : "fulladdr", "type" : "Address"},
{"field" : "zip", "type" : "ShortString"},
]
deduper = dedupe.Dedupe(fields)
deduper.sample(data_d, 200)
if os.path.exists(training_file):
print 'reading labeled examples from ', training_file
with open(training_file, 'rb') as f:
deduper.readTraining(f)
print 'starting active labeling...'
dedupe.consoleLabel(deduper)
deduper.train()
with open(training_file, 'w') as tf :
deduper.writeTraining(tf)
with open(settings_file, 'w') as sf :
deduper.writeSettings(sf)
print 'blocking...'
threshold = deduper.threshold(data_d, recall_weight=2)
print 'clustering...'
clustered_dupes = deduper.match(data_d, threshold)
print '# duplicate sets', len(clustered_dupes)
cluster_membership = {}
cluster_id = 0
for (cluster_id, cluster) in enumerate(clustered_dupes):
id_set, scores = cluster
cluster_d = [data_d[c] for c in id_set]
canonical_rep = dedupe.canonicalize(cluster_d)
for record_id, score in zip(id_set, scores) :
cluster_membership[record_id] = {
"cluster id" : cluster_id,
"canonical representation" : canonical_rep,
"confidence": score
}
singleton_id = cluster_id + 1
with open(output_file, 'w') as f_output:
writer = csv.writer(f_output)
with open(input_file) as f_input :
reader = csv.reader(f_input)
heading_row = reader.next()
heading_row.insert(0, 'confidence_score')
heading_row.insert(0, 'Cluster ID')
canonical_keys = canonical_rep.keys()
for key in canonical_keys:
heading_row.append('canonical_' + key)
writer.writerow(heading_row)
for row in reader:
row_id = int(row[0])
if row_id in cluster_membership :
cluster_id = cluster_membership[row_id]["cluster id"]
canonical_rep = cluster_membership[row_id]["canonical representation"]
row.insert(0, cluster_membership[row_id]['confidence'])
row.insert(0, cluster_id)
for key in canonical_keys:
row.append(canonical_rep[key].encode('utf8'))
else:
row.insert(0, None)
row.insert(0, singleton_id)
singleton_id += 1
for key in canonical_keys:
row.append(None)
writer.writerow(row)
具体来说,当我运行它时,我会得到以下信息:
C:\Anaconda\lib\site-packages\dedupe\core.py:18: UserWarning: There may be duplicates in the sample
warnings.warn("There may be duplicates in the sample")
Traceback (most recent call last):
File "<ipython-input-1-33e46d604c5f>", line 1, in <module>
runfile('H:/My Documents/Python Scripts/Dedupe/dupetestscript.py', wdir='H:/My Documents/Python Scripts/Dedupe')
File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "H:/My Documents/Python Scripts/Dedupe/dupetestscript.py", line 67, in <module>
deduper.sample(data_d, 200)
File "C:\Anaconda\lib\site-packages\dedupe\api.py", line 924, in sample
random_sample_size))
TypeError: unhashable type: 'numpy.ndarray'
C:\Anaconda\lib\site packages\dedupe\core.py:18:UserWarning:示例中可能存在重复项
警告。警告(“样本中可能有重复项”)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
runfile('H:/My Documents/Python Scripts/Dedupe/dupetestscript.py',wdir='H:/My Documents/Python Scripts/Dedupe')
文件“C:\Anaconda\lib\site packages\spyderlib\widgets\externalshell\sitecustomize.py”,第580行,在runfile中
execfile(文件名、命名空间)
文件“H:/My Documents/Python Scripts/Dedupe/dupetestscript.py”,第67行,在
重复数据消除器示例(数据量,200)
示例中第924行的文件“C:\Anaconda\lib\site packages\dedupe\api.py”
随机样本(样本大小)
TypeError:不可损坏的类型:“numpy.ndarray”
可以更改numpy数组(它是“可变的”)。Python通过使用键的散列值而不是键来加速字典访问
因此,字典中只能使用数字、字符串或元组等可散列对象作为键。从hashable的Python词汇表定义中:
如果对象的哈希值在其生存期内从未更改(它需要\uuuuuhash\uuuu
()方法),则该对象是可哈希的,并且可以
与其他对象相比(它需要\uuuu eq\uuu
()方法)。散列
比较相等的对象必须具有相同的哈希值
哈希性使对象可用作字典键和集合成员,因为这些数据结构在内部使用哈希值
Python的所有不可变内置对象都是可散列的,而没有可变容器(如列表或字典)是可散列的。对象
默认情况下,用户定义类的实例是可散列的;他们
所有比较都不相等(除了它们自己),并且它们的哈希值为
从其id
()派生
您确定您的示例代码是完整的吗?错误消息是关于numpy数组的,您的代码不使用该数组。它从numpy导入的唯一东西是NAN,它没有被使用。我无法复制它,代码也没有简化为实际问题。@strpeter并不奇怪,因为这是代码中的一个错误,已被标记并解决。看这里:这很不幸。如果你在熊猫公司工作,想制作一个dict呢?Pandas是所有numpy的幕后黑手…您可以从Pandas dataframe或numpy数组中的不可变数据创建dict键。您不能将数组或数据帧本身用作键。不过,您可以将其用作值。