Python 为什么我要使用多处理和处理来获得这个关键错误?
我试图在我编写的模糊匹配脚本上使用多处理,我需要进行14亿次比较,没有多处理需要30多个小时,所以我尝试在这里集成它Python 为什么我要使用多处理和处理来获得这个关键错误?,python,pandas,multiprocessing,Python,Pandas,Multiprocessing,我试图在我编写的模糊匹配脚本上使用多处理,我需要进行14亿次比较,没有多处理需要30多个小时,所以我尝试在这里集成它 def fuzzyCompare(data1, data2): print("Performing Fuzzy Matches...\n") similarityDf = pd.DataFrame(columns = ["Similarity Ratio", "Id1", Id2]) count = 0 for i in range(len(dat
def fuzzyCompare(data1, data2):
print("Performing Fuzzy Matches...\n")
similarityDf = pd.DataFrame(columns = ["Similarity Ratio", "Id1", Id2])
count = 0
for i in range(len(data1)):
str1 = data1["CompanyName"][i] + "," + data1["Address1"][i] + "," + data1["City"][i] + "," + data1["PostalZip"][i]
str1 = str1.lower().replace(" ","")
for j in range(len(data2)):
str2 = data2["Company"][j] + "," + data2["Physical Street 1"][j] + "," + data2["Physical City"][j] + "," + data2["Physical Postal Code/ZIP"][j]
str2 = str2.lower().replace(" ","")
ratio = fuzz.ratio(str1,str2)
if(ratio > 0):
similarityDf.at[count, "Similarity Ratio"] = str(ratio) + "%"
similarityDf.at[count, "Id1"] = data1["Id1][i]
similarityDf.at[count, "Id2"] = data2["Id2][j]
count = count + 1
print("Performed " + str(len(data1)*len(data2)) + " Fuzzy Comparisons.\n")
return similarityDf
def main():
data1 = readData(excelFile1) *#read excelfile into dataframe*
data2 = readData(excelFile2) *#read excelfile into dataframe*
df_split = np.array_split(data2, 4) *#split data2 into 4*
args = [(data1, df_split[0]),
(data1, df_split[1]),
(data1, df_split[2]),
(data1, df_split[3])]
with mp.Pool(processes=4) as p:
outputData = pd.concat(p.starmap(fuzzyCompare, args))
if __name__ == "__main__":
mp.freeze_support()
main()
我在我的fuzzyCompare()
末尾有一个print语句,它只打印一个worker的结果,然后我收到以下错误:
multiprocessing.pool.RemoteTraceback
Traceback (most recent call last):
File "C:\Users\...\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "C:\Users\...\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 47, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "C:\Users\...\Documents\Code\Python\fuzzyCompare\multiFuzzyCLI.py", line 47, in fuzzyCompare
str2 = data2["Company"][j] + "," + data2["Physical Street 1"][j] + "," + data2["Physical City"][j] + "," + data2["Physical Postal Code/ZIP"][j]
File "C:\Users\...\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\series.py", line 1068, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\...\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\indexes\base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 998, in
pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "multiFuzzyCLI.py", line 145, in <module>
main()
File "multiFuzzyCLI.py", line 132, in main
outputData = pd.concat(p.starmap(fuzzyCompare, args))
File "C:\Users\...\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 276, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "C:\Users\...\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
KeyError: 0
multiprocessing.pool.RemoteTraceback
回溯(最近一次呼叫最后一次):
worker中的文件“C:\Users\…\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py”,第121行
结果=(True,func(*args,**kwds))
文件“C:\Users\…\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py”,第47行,在starmapstar中
返回列表(itertools.starmap(args[0],args[1]))
文件“C:\Users\…\Documents\Code\Python\fuzzyCompare\multifuzzycyclicli.py”,第47行,在fuzzyCompare中
str2=data2[“公司”][j]+,“+data2[“实体街道1”][j]+,“+data2[“实体城市”][j]+,“+data2[“实体邮政编码/邮政编码”][j]
文件“C:\Users\…\AppData\Local\Programs\Python\Python37-32\lib\site packages\pandas\core\series.py”,第1068行,在u getitem中__
结果=self.index.get_值(self,key)
文件“C:\Users\…\AppData\Local\Programs\Python\Python37-32\lib\site packages\pandas\core\index\base.py”,第4730行,在get_值中
返回self.\u engine.get\u值(s,k,tz=getattr(series.dtype,“tz”,None))
pandas.\u libs.index.IndexEngine.get\u值中第80行的文件“pandas\\u libs\index.pyx”
pandas.\u libs.index.IndexEngine.get\u值中第88行的文件“pandas\\u libs\index.pyx”
pandas.\u libs.index.IndexEngine.get\u loc中的文件“pandas\\u libs\index.pyx”,第131行
pandas.\u libs.hashtable.Int64HashTable.get\u项目中的文件“pandas\\u libs\hashtable\u class\u helper.pxi”,第992行
文件“pandas\\u libs\hashtable\u class\u helper.pxi”,第998行,在
pandas._libs.hashtable.Int64HashTable.get_项
关键错误:0
上述异常是以下异常的直接原因:
回溯(最近一次呼叫最后一次):
文件“multiFuzzyCLI.py”,第145行,在
main()
文件“multiFuzzyCLI.py”,主目录第132行
outputData=pd.concat(p.starmap(fuzzyCompare,args))
文件“C:\Users\…\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py”,第276行,在星图中
返回self.\u map\u async(func,iterable,starmapstar,chunksize).get()
get中第657行的文件“C:\Users\…\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py”
提升自我价值
关键错误:0
我知道什么是键错误,我只是不明白在这种情况下它是如何得到错误的。
谢谢您会得到一个
键错误
,因为您尝试使用从0开始的索引索引每个数据帧,并且np。数组分割
将保持每个分割的原始索引
要正确索引数据帧的第i
-行,您应该始终使用DataFrame.iloc
,因为这通常适用于任何索引,而不一定是从0开始的范围索引。因此,您需要将所有选择更改为以下形式:
data2["Company"].iloc[j] # Not data2["Company"][j]
工作实例
非常感谢你!这修复了所有问题,现在它可以完美地工作:)
import pandas as pd
import numpy as np
df = pd.DataFrame({'CompanyName': list('abcdefghij')})
df_split = np.array_split(data2, 4)
# For first split this works as we get lucky the index starts from 0
data2 = df_split[0]
for j in range(len(data2)):
print(data2['CompanyName'][j])
# a
# b
# c
# Later slices fail; `df_split[1].index` is RangeIndex(start=3, stop=6, step=1)
data2 = df_split[1]
for j in range(len(data2)):
print(data2['CompanyName'][j])
# KeyError: 0
# Instead properly select with `.iloc`
for j in range(len(data2)):
print(data2['CompanyName'].iloc[j])
# d
# e
# f