Python 加快匹配字符串的速度
我有两个不同的数据帧,我正在尝试为其匹配字符串列(名称) 下面是DF的一些示例Python 加快匹配字符串的速度,python,string,pandas,dictionary,Python,String,Pandas,Dictionary,我有两个不同的数据帧,我正在尝试为其匹配字符串列(名称) 下面是DF的一些示例 df1 (127000,3) Code Name PostalCode 150 Maarc 47111 250 Kirc 41111 170 Moic 42111 140 Nirc 44111 550 Lacter 47111 df2 (38000,3) Code NAME POSTAL_CODE 1
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
其目的是创建另一个DF3,如下所示
Code NAME Best Match Score
150 Marc Maarc 0.9
250 Karc Kirc 0.9
下面的代码给出了预期的输出
import difflib
from functools import partial
f = partial(difflib.get_close_matches, possibilities= df1['Name'].tolist(), n=1)
matches = df2['NAME'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio()
for x, y in zip(matches, df2['NAME'])]
df3 = df2.assign(best=matches, score=scores)
df3.sort_values(by='score')
问题
为了只匹配两行的字符串,大约需要30秒。此任务必须完成1K行,这将花费数小时
问题
如何加快代码的速度?
我在想类似fetchall的事情
编辑
甚至FuzzyFuzzy库也已尝试过,这比使用以下代码的difflib花费的时间更长:
from fuzzywuzzy import fuzz
def get_fuzz(df, w):
s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
idx = s.idxmax()
return {'Name': df['Name'].iloc[idx], 'CODE': df['Code'].iloc[idx], 'Value': s.max()}
df2['NAME'].apply(lambda x: get_fuzz(df1, x))
df2 = df2.assign(search= df2['NAME'].apply(lambda x: get_fuzz(df1, x)))
我能想到的匹配字符串的最快方法是使用正则表达式 这是一种搜索语言设计,用于在字符串中查找匹配项 您可以在此处看到一个示例:
import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
//Outputs: x == true
*摘自:
由于我不了解任何数据帧,我不知道如何在代码中实现Regex,但我希望Regex函数可以帮助您。因此,我能够通过使用邮政编码列作为判别符来加快匹配步骤。我能够从1h40计算到7mn 下面是DF的一些示例
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
下面是匹配名称列并检索具有最佳分数的名称的代码
%%time
import difflib
from functools import partial
def difflib_match (df1, df2, set_nan = True):
# Fill NaN
df2['best']= np.nan
df2['score']= np.nan
# Apply function to retrieve unique first letter of Name's column
first= df2['POSTAL_CODE'].unique()
# Loop over each first letter to apply the matching by starting with the same Postal code for both DF
for m, letter in enumerate(first):
# IF Divid by 100, print Unique values processed
if m%100 == 0:
print(m, 'of', len(first))
df1_first= df1[df1['PostalCode'] == letter]
df2_first= df2[df2['POSTAL_CODE'] == letter]
# Function to match using the Name column from the Web
f = partial(difflib.get_close_matches, possibilities= df1_first['Name'].tolist(), n=1)
# Define which columns to compare while mapping with first letter
matches = df2_first['NAME'].map(f).str[0].fillna('')
# Retrieve the best score for each match
scores = [difflib.SequenceMatcher(None, x, y).ratio()
for x, y in zip(matches, df2_first['NAME'])]
# Assign the result to the DF
for i, name in enumerate(df2_first['NAME']):
df2['best'].where(df2['NAME'] != name, matches.iloc[i], inplace = True)
df2['score'].where(df2['NAME'] != name, scores[i], inplace = True)
return df2
# Apply Function
df_diff= difflib_match(df1, df2)
# Display DF
print('Shape: ', df_diff.shape)
df_diff.head()
不幸的是,我不认为difflib是这个任务的合适工具,它不是那么快。也许你可以尝试使用
sklearn
模块构建一个距离矩阵或类似的东西。对于您的情况,levenshtein距离可能很有趣。为什么您认为正则表达式会比文字匹配更快?这是有争议的,但通常取决于匹配的复杂性以及您编写正则表达式的能力,如以下链接所示: