如何使用Python高效地在另一个字符串列表中搜索字符串列表？_Python_Python 3.x_String_Performance_Search

如何使用Python高效地在另一个字符串列表中搜索字符串列表？

python python-3.x string performance search

如何使用Python高效地在另一个字符串列表中搜索字符串列表？,python,python-3.x,string,performance,search,Python,Python 3.x,String,Performance,Search,我有两个名称（字符串）列表，如下所示： executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady'] analysts = ['Justin Post', 'Some Dude', 'Some Chick'] str = ['Justin Post - Bank of America', "Great. Thank you for taking my question. I guess the big one is the deceler

我有两个名称（字符串）列表，如下所示：

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']

analysts = ['Justin Post', 'Some Dude', 'Some Chick']

str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.

我需要找到这些名称在字符串列表中的位置，如下所示：

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']

analysts = ['Justin Post', 'Some Dude', 'Some Chick']

str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.

我需要这样做的原因是，我可以将会话字符串连接在一起（由名称分隔）。我将如何有效地进行这项工作

我看了一些类似的问题，尝试了一些解决方案，但没有效果，例如：

if any(x in str for x in executives):
    print('yes')

而这个

match = next((x for x in executives if x in str), False)
match

我不确定这是否是您想要的：

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
text = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores."]

result = [s for s in text if any(ex in s for ex in executives)]
print(result)

输出： ['brianolsavsky-Amazon.com']

str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores"]

executives = ['Brian Olsavsky', 'Justin', 'Some Guy', 'Some Lady']

此外，如果您需要确切的位置，您可以使用：

print([[i, str.index(q), q.index(i)] for i in executives for q in str if i in q ])

这个输出

[['Brian Olsavsky', 3, 0], ['Justin', 0, 0], ['Justin', 4, 11], ['Justin', 9, 5]]

太长，读不下去了这个答案是关注效率。如果不是关键问题，请使用其他答案。如果是的话，从你正在搜索的语料库中做一个

dict

，然后用这个dict来找到你要找的东西

创建示例语料库首先，我们创建一个字符串列表，在其中进行搜索

创建随机单词，我指的是随机字符序列，长度从a开始，使用以下函数：

def poissonlength_words(lam_word): #generating words, length chosen from a Poisson distrib
    return ''.join([random.choice(string.ascii_lowercase) for _ in range(np.random.poisson(lam_word))])

（

lam_word

是的参数。）

让我们从这些单词中创建

number\u个句子

可变长度的句子（我所说的句子是指由空格分隔的随机生成的单词列表）

句子的长度也可以从一个句子中提取出来

句子[0]

现在将如下开始：

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']

analysts = ['Justin Post', 'Some Dude', 'Some Chick']

str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.

tptt lxnwf iem fedg wbfdq QA aqrys szwx zkmukc

让我们也创建名称，我们将搜索这些名称。让这些名字成为现实吧。名字（即bigram的第一个元素）将是

个字符，姓氏（第二个bigram元素）将是

个字符，它将由随机字符组成：

def bigramgen(n,m):
    return ''.join([random.choice(string.ascii_lowercase) for _ in range(n)])+' '+\
           ''.join([random.choice(string.ascii_lowercase) for _ in range(m)])

任务假设我们想找到出现诸如

abc

之类的bigram的句子。我们不想找到

dab c

或

ab cd

，只在

ab c

单独存在的地方

为了测试一个方法有多快，让我们找到不断增加的bigram数，并测量经过的时间。我们搜索的Bigram数可以是，例如：

number_of_bigrams_we_search_for = [10,30,50,100,300,500,1000,3000,5000,10000]

蛮力法

只需循环遍历每个二元图，循环遍历每个句子，在中使用

，即可找到匹配项。同时，使用time.time（）

bruteforcetime
将保留查找10,30,50。。。大人物
警告：对于大量的bigram，这可能需要很长时间

对您的资料进行排序以使其更快的方法

让我们为出现在任何句子中的每个单词创建一个空集（使用）：
对于这些集合中的每个集合，添加其出现的每个单词的索引
：
for sentencei, sentence in enumerate(sentences):
    for wordi, word in enumerate(sentence.split(' ')):
        worddict[word].add(sentencei)

请注意，无论以后搜索多少个bigram，我们只做一次
使用这本词典，我们可以搜索出现二元结构每个部分的句子。这是非常快的，因为打了一个电话。那我们就去吧。当我们搜索ab c
时，我们将有一组句子索引，其中ab
和c
都会出现
for bigram in bigrams:
    reslist=[]
    setlist = [worddict[gram] for gram in target.split(' ')]
    intersection = set.intersection(*setlist)
    for candidate in intersection:
        if bigram in sentences[candidate]:
            reslist.append([bigram, candidate])

让我们把整个事情放在一起，测量经过的时间：
logtime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
    
    bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
    
    start_time=time.time()
    
    worddict={word:set() for sentence in sentences for word in sentence.split(' ')}

    for sentencei, sentence in enumerate(sentences):
        for wordi, word in enumerate(sentence.split(' ')):
            worddict[word].add(sentencei)

    for bigram in bigrams:
        reslist=[]
        setlist = [worddict[gram] for gram in bigram.split(' ')]
        intersection = set.intersection(*setlist)
        for candidate in intersection:
            if bigram in sentences[candidate]:
                reslist.append([bigram, candidate])

    end_time=time.time()
    
    logtime.append(end_time-start_time)

警告：对于大量的bigram，这可能需要很长时间，但比暴力方法要少

结果
我们可以计算出每种方法需要多少时间
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')

或者，在以下位置绘制y轴

：

给我们一些情节：

制作

worddict

词典需要花费大量时间，而且在搜索少量姓名时是一个缺点。然而，有一点是，与暴力法相比，语料库足够大，我们搜索的名字数量足够多，这一次可以通过搜索速度来补偿。因此，如果满足这些条件，我建议使用这种方法

（笔记本可用。）

为什么您首先有两个姓名列表，而您的代码只遍历其中一个？这是问答记录，因此姓名是划分问答的最简单方法。这些名字还告诉我谁在问问题（分析师），谁在回答（高管）。如果这能使操作更容易/更有效，我也可以将这些名称放入字典。需要的输出是什么？检查这个答案，希望这对您有所帮助。最终，所需的输出将是一个新的字符串列表，如下所示：str=[“提问者”，“诸如此类”，“回答者”，“诸如此类”，“诸如此类”，“提问者”，“诸如此类”，“诸如此类”，“回答者”，“诸如此类]。关键的区别在于每个名字后面只有一个字符串，而不是几个。这太完美了！非常感谢。它暴露了我推理中的一个缺陷，即一些名字出现在问题中，但它确实完美地解决了我在文本中查找名字的问题。应该能够做一些轻微的修改，使它能够很容易地解决新问题。@RagnarLothbrok，我很高兴它能为您解决问题。请再看一遍代码，我用同样的回答对其进行了轻微修改。这也是非常有用的。非常感谢。这可能有助于我需要执行的一些操作。@ RangnLothBook我很高兴我可以帮助你，如果你考虑到（希望到达）这个问题的答案，你可能会更快。

plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.yscale('log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')