Python 加快匹配字符串的速度

Python 加快匹配字符串的速度,python,string,pandas,dictionary,Python,String,Pandas,Dictionary,我有两个不同的数据帧,我正在尝试为其匹配字符串列(名称) 下面是DF的一些示例 df1 (127000,3) Code Name PostalCode 150 Maarc 47111 250 Kirc 41111 170 Moic 42111 140 Nirc 44111 550 Lacter 47111 df2 (38000,3) Code NAME POSTAL_CODE 1

我有两个不同的数据帧,我正在尝试为其匹配字符串列(名称)

下面是DF的一些示例

df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111
df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111
其目的是创建另一个DF3,如下所示

Code     NAME    Best Match   Score
150      Marc    Maarc        0.9
250      Karc    Kirc         0.9
下面的代码给出了预期的输出

import difflib
from functools import partial
f = partial(difflib.get_close_matches, possibilities= df1['Name'].tolist(), n=1)

matches = df2['NAME'].map(f).str[0].fillna('')

scores = [difflib.SequenceMatcher(None, x, y).ratio()
          for x, y in zip(matches, df2['NAME'])]

df3 = df2.assign(best=matches, score=scores)
df3.sort_values(by='score')
问题

为了只匹配两行的字符串,大约需要30秒。此任务必须完成1K行,这将花费数小时

问题

如何加快代码的速度? 我在想类似fetchall的事情

编辑

甚至FuzzyFuzzy库也已尝试过,这比使用以下代码的difflib花费的时间更长:

from fuzzywuzzy import fuzz

def get_fuzz(df, w):
    s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
    idx = s.idxmax()
    return {'Name': df['Name'].iloc[idx], 'CODE': df['Code'].iloc[idx], 'Value': s.max()}

df2['NAME'].apply(lambda x: get_fuzz(df1, x))

df2 = df2.assign(search= df2['NAME'].apply(lambda x: get_fuzz(df1, x)))

我能想到的匹配字符串的最快方法是使用正则表达式

这是一种搜索语言设计,用于在字符串中查找匹配项

您可以在此处看到一个示例:

import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

//Outputs: x == true
*摘自:


由于我不了解任何数据帧,我不知道如何在代码中实现Regex,但我希望Regex函数可以帮助您。

因此,我能够通过使用邮政编码列作为判别符来加快匹配步骤。我能够从1h40计算到7mn

下面是DF的一些示例

df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111
df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111
下面是匹配名称列并检索具有最佳分数的名称的代码

%%time
import difflib
from functools import partial

def difflib_match (df1, df2, set_nan = True):

    # Fill NaN
    df2['best']= np.nan
    df2['score']= np.nan

    # Apply function to retrieve unique first letter of Name's column
    first= df2['POSTAL_CODE'].unique()

    # Loop over each first letter to apply the matching by starting with the same Postal code for both DF
    for m, letter in enumerate(first):

        # IF Divid by 100, print Unique values processed 
        if m%100 == 0:
            print(m, 'of', len(first))

        df1_first= df1[df1['PostalCode'] == letter]
        df2_first= df2[df2['POSTAL_CODE'] == letter]

        # Function to match using the Name column from the Web                   
        f = partial(difflib.get_close_matches, possibilities= df1_first['Name'].tolist(), n=1) 

        # Define which columns to compare while mapping with first letter
        matches = df2_first['NAME'].map(f).str[0].fillna('')

        # Retrieve the best score for each match
        scores = [difflib.SequenceMatcher(None, x, y).ratio()
              for x, y in zip(matches, df2_first['NAME'])]

        # Assign the result to the DF
        for i, name in enumerate(df2_first['NAME']):
            df2['best'].where(df2['NAME'] != name, matches.iloc[i], inplace = True)
            df2['score'].where(df2['NAME'] != name, scores[i], inplace = True)

    return df2

# Apply Function
df_diff= difflib_match(df1, df2)

# Display DF
print('Shape: ', df_diff.shape)
df_diff.head()

不幸的是,我不认为difflib是这个任务的合适工具,它不是那么快。也许你可以尝试使用
sklearn
模块构建一个距离矩阵或类似的东西。对于您的情况,levenshtein距离可能很有趣。为什么您认为正则表达式会比文字匹配更快?这是有争议的,但通常取决于匹配的复杂性以及您编写正则表达式的能力,如以下链接所示: