在Python中使用fuzzyfuzzy/rapidfuzz提高字符串匹配性能
我有一本很大的字典,里面储存了大量的英语句子和它们的西班牙语翻译。我的原始代码如下所示:在Python中使用fuzzyfuzzy/rapidfuzz提高字符串匹配性能,python,multithreading,performance,fuzzywuzzy,rapidfuzz,Python,Multithreading,Performance,Fuzzywuzzy,Rapidfuzz,我有一本很大的字典,里面储存了大量的英语句子和它们的西班牙语翻译。我的原始代码如下所示: from fuzzywuzzy import process sentencePairs = {'How are you?':'¿Cómo estás?', 'Good morning!':'¡Buenos días!'} query= 'How old are you?' match = process.extractOne(query, sentencePairs.keys())[0] print(ma
from fuzzywuzzy import process
sentencePairs = {'How are you?':'¿Cómo estás?', 'Good morning!':'¡Buenos días!'}
query= 'How old are you?'
match = process.extractOne(query, sentencePairs.keys())[0]
print(match, sentencePairs[match], sep='\n')
然后我使用rapidfuzz而不是FuzzyFuzzy来实现更快的速度。我也尝试过多线程,但令人惊讶的是,它没有多大帮助。我的新代码如下:
from rapidfuzz import process, utils, fuzz
from concurrent.futures import ThreadPoolExecutor
import time, string, random
random.seed(18)
def findMatch(query, dictionary):
match, score = process.extractOne(
utils.default_process(query),
dictionary.keys(),
processor=None,
scorer=fuzz.ratio)
return (match, score)
# make a dictionary for testing
d = {
''.join(random.choice(string.ascii_lowercase + string.digits)
for _ in range(15)
): "spanish text"
for s in range(1000000)
}
d['how are you?'] = '¿Cómo estás?'
# split the dictionary in half for multithreading
d1 = dict(list(d.items())[:len(d)//2])
d2 = dict(list(d.items())[len(d)//2:])
query= 'How old are you?'
# ---with multithreading---
start_time1 = time.time()
print('Start matching with multithreading...')
with ThreadPoolExecutor() as executor:
future = executor.submit(findMatch, query, d1)
match1, score1 = future.result()
with ThreadPoolExecutor() as executor:
future = executor.submit(findMatch, query, d2)
match2, score2 = future.result()
if score1 >= score2 and score1 > 70:
print(match1, d[match1], sep=' - ')
elif score2 > score1 and score2 > 70:
print(match2, d[match2], sep=' - ')
else:
print('No match found.')
print('Time spent with multithreading: {}\n'.format(time.time() - start_time1))
# ---without multithreading---
start_time2 = time.time()
print('Start matching without multithreading...')
match, score = findMatch(query, d)
if score > 70:
print(match, d[match], sep=' - ')
print('Time spent without multithreading: {}'.format(time.time() - start_time2))
我原以为多线程可以大大减少匹配时间,但事实恰恰相反。有没有办法大幅缩短匹配时间?或者我用错了多线程