Python加速语料库文档相似度计算
我的输入是一个(spintax)格式的字符串 然后使用itertools生成所有可能的组合。 e、 g 在这些字符串中,我只希望根据相似度阈值保留最独特的字符串,例如,仅保留相似度小于60%的字符串。我使用了库,但由于循环,它不能很好地处理大数据集(250K+项)。这是目前的实施情况,Python加速语料库文档相似度计算,python,cosine-similarity,Python,Cosine Similarity,我的输入是一个(spintax)格式的字符串 然后使用itertools生成所有可能的组合。 e、 g 在这些字符串中,我只希望根据相似度阈值保留最独特的字符串,例如,仅保留相似度小于60%的字符串。我使用了库,但由于循环,它不能很好地处理大数据集(250K+项)。这是目前的实施情况, def filter_descriptions(descriptions): MAX_SIMILAR_ALLOWED = 0.6 #40% unique and 60% similar i =
def filter_descriptions(descriptions):
MAX_SIMILAR_ALLOWED = 0.6 #40% unique and 60% similar
i = 0
while i < len(descriptions):
print("Processing {}/{}...".format(i + 1, len(descriptions)))
desc_to_evaluate = descriptions[i]
j = i + 1
while j < len(descriptions):
similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
if similarity_ratio > MAX_SIMILAR_ALLOWED:
del descriptions[j]
else:
j += 1
i += 1
return descriptions
from sklearn.metrics.pairwise import cosine_similarity
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)
val = cosine_similarity(tfidf_matrix[:10000], tfidf_matrix[:10000])
有什么优化的解决方案吗?我只想从列表中选择n个最独特的字符串。可以优化的一件事是使用
del
。现在需要多次执行del,尽管我不知道Python如何处理这个问题,但我认为使用一个del语句的解决方案更好,因为我相信Python必须为执行的每个del创建一个新列表
因此我决定测试这种方法:
import time
import argparse
def test1(long_list, max_num):
"""
Removing values from a list with delete every step in the loop
"""
i = 0
while i < len(long_list):
if long_list[i] > max_num:
del long_list[i]
else:
i += 1
return long_list
def test2(long_list, max_num):
"""
Removing values from a list with delete, lastly after swapping values into the back of the array - marked as garbage
"""
garbage_index = len(long_list) - 1
i = 0
while i <= garbage_index:
if long_list[i] > max_num:
long_list[i],long_list[garbage_index] = long_list[garbage_index], long_list[i]
garbage_index -= 1
else:
i += 1
del long_list[garbage_index + 1 :]
return long_list
def get_args():
"""
Fetches needed arguments for test1() and test2()
"""
parser = argparse.ArgumentParser()
parser.add_argument("list_size", help="Set the size of the list.", type=int)
parser.add_argument("max_element", help="Set max-element value.", type=int)
return parser.parse_args()
if __name__ == '__main__':
"""
Simply times the two test functions and prints the time difference
"""
args = get_args()
long_list = [x for x in range(args.list_size) ]
print("Using list size {}".format(args.list_size))
start = time.time()
test1(long_list, args.max_element)
end1 = time.time()
test2(long_list, args.max_element)
end2 = time.time()
print("test1:",end1-start)
print("test2:",end2-end1)
test2()解决方案也不创建新的垃圾列表,而是在同一个列表中使用内存内交换,从而节省空间和时间
希望这有助于实现更优化的算法。这不是一个解决方案,但我认为您有一个bug。删除描述[j]后不应增加j。更新了代码,几小时前修复了该错误:)我可能遗漏了一些内容,但在找到类似的描述后,为什么要重新开始?为什么不在删除描述[j]后保持j的原样呢?修复了…感谢您确定了这一点,难怪它花费了很多时间:)但我相信还有更好的方法……在某个地方
from sklearn.metrics.pairwise import cosine_similarity
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)
val = cosine_similarity(tfidf_matrix[:10000], tfidf_matrix[:10000])
import time
import argparse
def test1(long_list, max_num):
"""
Removing values from a list with delete every step in the loop
"""
i = 0
while i < len(long_list):
if long_list[i] > max_num:
del long_list[i]
else:
i += 1
return long_list
def test2(long_list, max_num):
"""
Removing values from a list with delete, lastly after swapping values into the back of the array - marked as garbage
"""
garbage_index = len(long_list) - 1
i = 0
while i <= garbage_index:
if long_list[i] > max_num:
long_list[i],long_list[garbage_index] = long_list[garbage_index], long_list[i]
garbage_index -= 1
else:
i += 1
del long_list[garbage_index + 1 :]
return long_list
def get_args():
"""
Fetches needed arguments for test1() and test2()
"""
parser = argparse.ArgumentParser()
parser.add_argument("list_size", help="Set the size of the list.", type=int)
parser.add_argument("max_element", help="Set max-element value.", type=int)
return parser.parse_args()
if __name__ == '__main__':
"""
Simply times the two test functions and prints the time difference
"""
args = get_args()
long_list = [x for x in range(args.list_size) ]
print("Using list size {}".format(args.list_size))
start = time.time()
test1(long_list, args.max_element)
end1 = time.time()
test2(long_list, args.max_element)
end2 = time.time()
print("test1:",end1-start)
print("test2:",end2-end1)
$ python3 Code/Playground/stackoverflow/pyspeedup.py 10 5
Using list size 10
test1: 4.5299530029296875e-06
test2: 2.384185791015625e-06
$ python3 Code/Playground/stackoverflow/pyspeedup.py 100 50
Using list size 100
test1: 1.71661376953125e-05
test2: 5.9604644775390625e-06
$ python3 Code/Playground/stackoverflow/pyspeedup.py 1000 500
Using list size 1000
test1: 0.00022935867309570312
test2: 4.506111145019531e-05
$ python3 Code/Playground/stackoverflow/pyspeedup.py 10000 5000
Using list size 10000
test1: 0.006038665771484375
test2: 0.00046563148498535156
$ python3 Code/Playground/stackoverflow/pyspeedup.py 100000 5000
Using list size 100000
test1: 2.022616386413574
test2: 0.0004937648773193359
$ python3 Code/Playground/stackoverflow/pyspeedup.py 1000000 5000
Using list size 1000000
test1: 224.23923707008362
test2: 0.0005621910095214844
$ python3 Code/Playground/stackoverflow/pyspeedup.py 10000000 5000
Using list size 10000000
test1: 43293.87373256683
test2: 0.0005309581756591797