Python重复数据消除.io从SQL Server读取数据时出现问题_Python_Sql Server_Duplicates_Python Dedupe

Python重复数据消除.io从SQL Server读取数据时出现问题

python sql-server

Python重复数据消除.io从SQL Server读取数据时出现问题,python,sql-server,duplicates,python-dedupe,Python,Sql Server,Duplicates,Python Dedupe,我正在尝试从SQL Server中提取一个大型数据集，并使用Python的重复数据消除库对信息进行重复数据消除。我正在使用pyodbc作为db连接器，但我不知道如何使用SQL Server将数据转换为正确的格式。在MySQL上工作正常，但如果没有Dict行读取，我就无法理解数据的格式。目前，我看到以下错误： TypeError:行索引必须是整数，而不是str 以下是试图构建数据的代码： cur = con.cursor() print("\n\nExecuiting TOMIS Select"

我正在尝试从SQL Server中提取一个大型数据集，并使用Python的重复数据消除库对信息进行重复数据消除。我正在使用pyodbc作为db连接器，但我不知道如何使用SQL Server将数据转换为正确的格式。在MySQL上工作正常，但如果没有Dict行读取，我就无法理解数据的格式。目前，我看到以下错误：

TypeError:行索引必须是整数，而不是str

以下是试图构建数据的代码：

cur = con.cursor()

print("\n\nExecuiting TOMIS Select")
cur.execute(TOMISSelect)
print("\nSelect Complete")
colHeader = [column[0] for column in cur.description]
temp_d = {0:tuple(colHeader)}
temp_data = {(i+1): row for i, row in enumerate(cur)}
temp_d.update(temp_data)

if os.path.exists(training_file):
    print("\nReading labeled examples from ", training_file)
    with open(training_file) as tf:
        deduper.prepare_training(temp_d, tf)
else:
    print("\nManual Training")
    deduper.prepare_training(temp_d)

以下是输出和完整跟踪：

Manual Training
Traceback (most recent call last):

  File "C:\Users\01-workspace\02-dedupe\TOMISDeDupe\TomisFullDeDupe.py", line 134, in <module>
    deduper.prepare_training(temp_d)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\api.py", line 806, in prepare_training
    self.sample(data, sample_size, blocked_proportion, original_length)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\api.py", line 838, in sample
    index_include=examples)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\labeler.py", line 403, in __init__
    self.candidates = super().sample(data, blocked_proportion, sample_size)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\labeler.py", line 43, in sample
    data)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 22, in blockedSample
    *args))

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 62, in dedupeSamplePredicates
    items)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 73, in dedupeSamplePredicate
    column = record[field]

TypeError: row indices must be integers, not str

手工培训
回溯（最近一次呼叫最后一次）：
文件“C:\Users\01 workspace\02 dedupe\TOMISDeDupe\TomisFullDeDupe.py”，第134行，在
重复数据消除器。准备培训（临时）
文件“c:\users\01 workspace\02 Duplicate\Duplicate\Duplicate\api.py”，第806行，在prepare\u training中
自身样本（数据、样本大小、分块比例、原始长度）
示例中第838行的文件“c:\users\01 workspace\02 Duplicate\Duplicate\Duplicate\Duplicate\api.py”
索引（包括=示例）
文件“c:\users\01 workspace\02 Duplicate\Duplicate\Duplicate\labeler.py”，第403行，在\uuu init中__
self.candidates=super（）.sample（数据、阻塞比例、样本大小）
文件“c:\users\01 workspace\02 Duplicate\Duplicate\Duplicate\labeler.py”，第43行，在示例中
（数据）
blockedSample中第22行的文件“c:\users\01 workspace\02 Duplicate\Duplicate\Duplicate\sampling.py”
*args）
文件“c:\users\01 workspace\02 Duplicate\Duplicate\Duplicate\sampling.py”，第62行，在DuplicateSamplePredicates中
（项目）
文件“c:\users\01 workspace\02 Duplicate\Duplicate\Duplicate\sampling.py”，第73行，在DuplicateSamplePredicate中
列=记录[字段]
TypeError:行索引必须是整数，而不是str

我尝试了多种不同的方法从SQL Server读取数据，但都没有用——MySQL查询将数据转储到正确的字典格式，而我似乎无法使用SQL Server以正确的格式获取数据。

我认为您需要执行类似的操作

colHeader=tuple（列[0]表示当前描述中的列）
temp_d={i:dict（zip（colHeader，row））表示枚举（cur）中的i，row}

我想你需要做一些类似的事情

colHeader=tuple（列[0]表示当前描述中的列）
temp_d={i:dict（zip（colHeader，row））表示枚举（cur）中的i，row}