Python 用于全文匹配的各种搜索算法和性能_Python_Algorithm_Performance_Search_Full Text Search

Python 用于全文匹配的各种搜索算法和性能

python algorithm performance search

Python 用于全文匹配的各种搜索算法和性能,python,algorithm,performance,search,full-text-search,Python,Algorithm,Performance,Search,Full Text Search,假设我有以下字符串： /Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt 在给定的输入条件下，如何确定搜索此字符串以查找匹配项的最快方法['sicario'，'419'] 例如，最基本的版本是： 1）字符串包含： s = '/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt' terms = ['sicario', '419'] has_ma

假设我有以下字符串：

/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt

在给定的输入条件下，如何确定搜索此字符串以查找匹配项的最快方法

['sicario'，'419']

例如，最基本的版本是：

1）字符串包含：

s = '/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt'
terms = ['sicario', '419']
has_match = all([term.lower() in s.lower() for term in terms])

2）正则表达式

其他可能的选择是：

拉宾卡普
博耶摩尔
及其他

各种算法的时间复杂度是多少

下面是我在中获得的计时

regex

与

str（）的比较：
 import timeit

 # Case insensitive
setup =  's="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; terms = ["sicario", "419"]'
print min(timeit.Timer('all([term in s.lower() for term in terms])', setup=setup).repeat(7, 1000))
0.00134181976318

# Case sensitive
setup =  's="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; terms = ["sicario", "419"]'
print min(timeit.Timer('all([term in s for term in terms])', setup=setup).repeat(7, 1000))
0.000231027603149


# Regex case insensitive
setup =  'import re; s="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; compiled_terms = [re.compile("sicario", re.I), re.compile("419", re.I)]'
print min(timeit.Timer('all([compiled_term.search(s) for compiled_term in compiled_terms])', setup=setup).repeat(7, 1000))
0.00111889839172


# Regex case sensitive
setup =  'import re; s="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; compiled_terms = [re.compile("sicario"), re.compile("419")]'
print min(timeit.Timer('all([compiled_term.search(s) for compiled_term in compiled_terms])', setup=setup).repeat(7, 1000))
0.000588893890381

这非常接近，尽管区分大小写的字符串搜索的性能比regex好2倍左右（至少在这个输入数据上）。
我认为您的基本版本是可用的最快的（）（良好情况下的次线性搜索行为（O（n/m）），我将对您的基本版本做一些小的更改：
def test():
    lower_s = s .lower()
    return all([term in lower_s for term in terms])

@ruso——你能解释一下为什么它会最快吗？基本上，我正在努力学习更多关于算法的知识，等等。在幕后，以及为什么一种方法会比另一种更好。另外，你可以看一下搜索算法/单模式算法的基本分类，我认为它是最快的，因为它是简单的用c语言修订
def test():
    lower_s = s .lower()
    return all([term in lower_s for term in terms])