Python嵌套循环替代方案_Python_List_For Loop_Nlp_Cosine Similarity

Python嵌套循环替代方案

python list for-loop nlp

Python嵌套循环替代方案,python,list,for-loop,nlp,cosine-similarity,Python,List,For Loop,Nlp,Cosine Similarity,我有两个包含文本的大列表。 X=[30000条记录]和 Y=[400个条目] 我想用余弦相似度找出两个列表中相似的文本。下面是我尝试使用嵌套for循环执行的代码 vectorizer = CountVectorizer() found_words = [] for x in X: for y in Y: vector1 = vectorizer(x.lower()) vector2 = vectorizer(y.lower()) sim = c

我有两个包含文本的大列表。 X=[30000条记录]和 Y=[400个条目]

我想用余弦相似度找出两个列表中相似的文本。下面是我尝试使用嵌套for循环执行的代码

vectorizer = CountVectorizer()
found_words = []
for x in X:
    for y in Y:
       vector1 = vectorizer(x.lower())
       vector2 = vectorizer(y.lower())
       sim = cosine_similarity(vector1, vector2)
       if sim > 0.9:
           found_words.append(x.capitalize())

上面的代码工作正常，但执行起来需要很多时间。有没有其他方法在时间复杂度和空间复杂度上都是有效的。谢谢

您可以计算归一化向量的点积，而不是余弦。然后，可以在此操作之前进行矢量化

以下是我尝试用随机向量复制测试的结果：

import numpy as np 

# assume vector dimension is 100:
a = np.random.random([30000, 100]) # X vectors
b = np.random.random([400, 100]) # Y vectors

a = np.array([[_v/np.linalg.norm(_v)] for _v in a]) # shape (30000, d, 1)
b = np.array([[_v/np.linalg.norm(_v)] for _v in b]) # shape (400, d, 1)

sims = np.tensordot(a, b, axes=([1,2], [1,2]))

print(np.where(sims > 0.87)[0]) # index of matched item in X

我将阈值降低到

0.87

，以便能够在随机向量中显示一些结果

用矢量化代码替换随机

和

：

vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]

最后，您还需要使用

索引返回实际的源代码

x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]

如果您可以访问支持CUDA的Nvidia GPU，则可以使用该GPU进行更快/并行化的tensor操作。您可以使用

火炬

访问设备：

import torch
import numpy as np

vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]

# normalize the vectors and also convert them to tensor types
a = torch.tensor([[_v/np.linalg.norm(_v)] for _v in a], device='cuda') # shape (30000, d, 1)
b = torch.tensor([[_v/np.linalg.norm(_v)] for _v in b], device='cuda') # shape (400, d, 1)

sims = torch.tensordot(a, b, dims=([1, 2], [1, 2])).cpu().numpy()
# shape (30000, 400)

x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]

这里也有类似的答案：我确实看到了，但由于我的列表非常庞大，执行时所花费的时间与嵌套for loops相同。当您移动：

vector1=vectorizer（x.lower（））

到y中y的

之前：

？好的，看起来您需要

多处理。无论如何，您应该只为每个字符串执行一次s.lower（）
，因此您可能希望，例如，Y=[Y.lower（）for Y in Y]
在循环之前执行。