Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 什么是「;成对;比较RecordLink中每个记录对的记录时?_Python_Python 3.x_Duplicates_Data Cleaning_Record Linkage - Fatal编程技术网

Python 什么是「;成对;比较RecordLink中每个记录对的记录时?

Python 什么是「;成对;比较RecordLink中每个记录对的记录时?,python,python-3.x,duplicates,data-cleaning,record-linkage,Python,Python 3.x,Duplicates,Data Cleaning,Record Linkage,我有一本书。有几行是关于同一个房地产的,所以它充满了不完全相同的重复。看起来是这样的: ID URL CRAWL_SOURCE PROPERTY_TYPE NEW_BUILD DESCRIPTION IMAGES SURFACE LAND_SURFACE BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY ZIP_CODE DEPT_CODE PUBLICATION_START_DATE

我有一本书。有几行是关于同一个房地产的,所以它充满了不完全相同的重复。看起来是这样的:

    ID  URL CRAWL_SOURCE    PROPERTY_TYPE   NEW_BUILD   DESCRIPTION IMAGES  SURFACE LAND_SURFACE    BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY    ZIP_CODE    DEPT_CODE   PUBLICATION_START_DATE  PUBLICATION_END_DATE    LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
0   22c05930-0eb5-11e7-b53d-bbead8ba43fe    http://www.avendrealouer.fr/location/levallois...   A_VENDRE_A_LOUER    APARTMENT   False   Au rez de chaussée d'un bel immeuble récent,...   ["https://cf-medias.avendrealouer.fr/image/_87...   72.0    NaN NaN ... Lamirand Et Associes    AGENCY  54178039    Levallois-Perret    92300.0 92  2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
1   8d092fa0-bb99-11e8-a7c9-852783b5a69d    https://www.bienici.com/annonce/ag440414-16547...   BIEN_ICI    APARTMENT   False   Je vous propose un appartement dans la rue Col...   ["http://photos.ubiflow.net/440414/165474561/p...   48.0    NaN NaN ... Proprietes Privees  MANDATARY   54178039    Levallois-Perret    92300.0 92  2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89  2018-09-25
...
我想在数据集中找到属于具有RecordLink的同一实体的记录。因此,我阅读并模仿了同样的内容:

indexer = recordlinkage.Index()
indexer.full()
candidate_links = indexer.index(df)

print (len(df), len(candidate_links))
21642340366

每个记录对都是一个候选匹配项,为了将候选记录对分为匹配项和非匹配项,我想比较两个记录共有的所有属性上的记录。RecordLink模块有一个名为Compare的类。此类用于比较记录。以下代码显示了我是如何比较属性的:

compare_cl=recordlinkage.compare()

然而,它给我的是:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-51-1e55ea540dbd> in <module>
      9 #compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
     10 
---> 11 features = compare_cl.compute(pairs, df)

NameError: name 'pairs' is not defined
---------------------------------------------------------------------------
NameError回溯(最近一次呼叫上次)
在里面
9比较字串('address_1','address_1',threshold=0.85,label='address_1')
10
--->11特征=比较计算(成对,df)
NameError:未定义名称“pairs”

我找不到什么配对…

请尝试使用候选链接

计算(成对,x,x_链接=无) 比较每个记录对的记录

调用此方法将开始比较记录

参数: pairs(pandas.MultiIndex)–一个pandas多索引,包含要比较的记录对。多索引中的索引是要链接的数据帧的索引

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-51-1e55ea540dbd> in <module>
      9 #compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
     10 
---> 11 features = compare_cl.compute(pairs, df)

NameError: name 'pairs' is not defined