用Python中的正则表达式优化查找两个列表之间的匹配子字符串_Python_Regex_String_List_Match

用Python中的正则表达式优化查找两个列表之间的匹配子字符串

python regex string list

用Python中的正则表达式优化查找两个列表之间的匹配子字符串,python,regex,string,list,match,Python,Regex,String,List,Match,下面是我的方法，通过搜索包含“单词”的列表，在包含“短语”的列表中查找子字符串，并返回在包含短语的列表中的每个元素中找到的匹配子字符串 import re def is_phrase_in(phrase, text): return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None list_to_search = ['my', 'name', 'is', 'you', 'your'] list_

下面是我的方法，通过搜索包含“单词”的列表，在包含“短语”的列表中查找子字符串，并返回在包含短语的列表中的每个元素中找到的匹配子字符串

import re

def is_phrase_in(phrase, text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']

to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
print(to_be_appended)

# (desired and actual) output
[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]

由于“单词”（或列表到搜索）列表约有1700个单词，“短语”（或列表到搜索）列表约有26561个单词，完成代码需要30分钟。我不认为我上面的代码是考虑到python式的编码方式和高效的数据结构而实现的：(

有没有人能提供一些建议来优化或加快它

谢谢

事实上，我写错了上面的例子。如果“列表到搜索”中的元素超过2个单词，该怎么办

import re

def is_phrase_in(phrase, text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']

to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
print(to_be_appended)
# (desired and actual) output
[['hello my'],
 ['name', 'is'],
 ['name', 'is'],
 [],
 ['name', 'is', 'is your name', 'your'],
 ['name', 'is']]

时机第一种方法：

%%timeit
def is_phrase_in(phrase, text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

    list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
    list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
#43.2 µs ± 346 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

第二种方法（嵌套列表理解和re.findall）

时间安排确实有所改善，但是否有更快的方法？或者，考虑到任务的功能，任务的遗传速度较慢？

您可以使用嵌套列表理解：

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name',
                       'how are you', 'what is your name', 'my name is jane doe']

[[j for j in list_to_search if j in i.split()] for i in list_to_be_searched]

[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]

虽然最直接/清晰的方法是使用列表理解，但我想看看regex是否可以做得更好

在

list\u to\u search

中的每个项目上使用正则表达式似乎没有任何性能提升。但是将

list\u to\u search

加入一大块文本，并将其与从

list\u to\u search

构建的正则表达式模式相匹配，性能略有提高：

[1]中的

：导入re
...:
…：list_to_search=[“我的”、“名字”、“是”、“你的”、“你的”]
…：要搜索的列表=[“你好，我的”，“我的名字是”，“约翰·多伊的姓”，“你好”，“你叫什么名字”，“我的名字是简·多伊”]
...:
…：定义简单搜索方法（要搜索，要搜索）：
…：返回[[j表示j in to_search，如果j in i.split（）]表示i in to_search]
...:
…：def regex_方法（待搜索、待搜索）：
…：word=re.compile（r'（\b（？：'+r'|'.join（to_search）+r'）\b（？：：））'））
…：blob='\n'.加入（待搜索）
…：短语=word.findall（blob）
…：为“”中的短语返回[phrase.split（“”）。连接（短语）.split（'\n'）]
...:
…：def alternate_regex_方法（待搜索、待搜索）：
…：word=re.compile（r'（\b（？：'+r'|'.join（to_search）+r'）\b（？：：））'））
…：短语=[]
…：对于要搜索的项目：
…：短语.append（word.findall（项目））
…：返回短语
...:
在[2]：%timeit-n100简单搜索方法（列表到搜索，列表到被搜索）
100个回路，最好为3个：每个回路23.1µs
在[3]中：%timeit-n100正则表达式方法（列表到搜索，列表到被搜索）
100个回路，最好为3个：每个回路18.6µs
在[4]中：%timeit-n100交替正则表达式方法（列表到搜索，列表到被搜索）
100个回路，最好为3个：每个回路23.4µs

为了了解这在大量输入下的表现，我使用了1000个英语中最常用的单词，一次一个单词作为

list\u To\u search

，古腾堡项目的大卫·科波菲尔的全文一次一行作为

list\u To\u search

：

[5]中的

：book=open（'/tmp/copperfield.txt'，r+）
在[6]中：要搜索的列表=[书中一行一行]
在[7]中：len（要搜索的列表）
Out[7]：38589
[8]中：words=open（'/tmp/words.txt'，r+'）
在[9]：列表到搜索=[逐字逐句]
In[10]：len（列表到搜索）
Out[10]：1000

结果如下：

[15]中的

：%timeit-n10简单方法（列表到搜索，列表到被搜索）
10圈，最佳3圈：每圈31.9秒
在[16]中：%timeit-n10正则表达式方法（列表到搜索，列表到被搜索）
10圈，最佳3圈：每圈4.28秒
在[17]中：%timeit-n10替代正则表达式方法（列表到搜索，列表到被搜索）
10圈，最佳3圈：每圈4.43秒

因此，如果您对性能感兴趣，请使用其中一种正则表达式方法。希望有帮助！：）

可能会先将

list\u to\u search

转换为

set

，然后将

re.findall

与

\b

一起使用，而不是

split

。感谢您的详细回答！这真的很有帮助。但是“regex_方法”是否能够像多个单词一样捕获？

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name',
                       'how are you', 'what is your name', 'my name is jane doe']

[[j for j in list_to_search if j in i.split()] for i in list_to_be_searched]

[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]