Performance nltk相似性性能问题？_Performance_Nltk_Similarity

Performance nltk相似性性能问题？

performance

Performance nltk相似性性能问题？,performance,nltk,similarity,Performance,Nltk,Similarity,nltk有一个很好的word2word相似性函数，该函数通过术语与常用超词的接近程度来衡量相似性。虽然相似性函数不适用于不同pos标签的两个术语不同的情况，但它仍然很有用然而，我发现它太慢了。。。这比只进行术语匹配要慢10倍。nltk相似性函数是否会变得更快我已使用以下代码进行了测试： from nltk import stem, RegexpStemmer from nltk.corpus import wordnet, stopwords from nltk.tag import pos

nltk有一个很好的word2word相似性函数，该函数通过术语与常用超词的接近程度来衡量相似性。虽然相似性函数不适用于不同pos标签的两个术语不同的情况，但它仍然很有用

然而，我发现它太慢了。。。这比只进行术语匹配要慢10倍。nltk相似性函数是否会变得更快

我已使用以下代码进行了测试：

from nltk import stem, RegexpStemmer
from nltk.corpus import wordnet, stopwords
from nltk.tag import pos_tag
import time

file1 = open('./tester.csv', 'r')

def similarityCal(word1, word2):
  synset1 = wordnet.synsets(word1)
  synset2 = wordnet.synsets(word2)
  if len(synset1) != 0 and len(synset2) != 0:
    wordFromList1 = synset1[0]
    wordFromList2 = synset2[0]
    return wordFromList1.wup_similarity(wordFromList2)
  else:
    return 0


start_time = time.time()
file1lines = file1.readlines()

stopwords = stopwords.words('english')
previousLine = ""
currentLine = ""
cntOri = 0
cntExp = 0

for line1 in file1lines:  
  currentLine = line1.lower().strip()
  if previousLine == "":
    previousLine = currentLine
    continue
  else:
    for tag1 in pos_tag(currentLine.split(" ")):
      tmpStr1 = tag1[0];
      if tmpStr1 not in stopwords and len(tmpStr1) > 1:
        if tmpStr1 in previousLine:
          print("termMatching word", tmpStr1);
          cntOri = cntOri + 1
      for tag2 in pos_tag(previousLine.split(" ")):
        tmpStr2 = tag2[0];
        if tag1[1].startswith("NN") and tag2[1].startswith("NN") or tag1[1].startswith("VB") and tag2[1].startswith("VB"):
          value = similarityCal(tmpStr1, tmpStr2)
          if type(value) is float and value > 0.8:
            print(tmpStr1, " similar to " , tmpStr2 , " ", value)
            cntExp = cntExp + 1
    previousLine = currentLine

end_time = time.time()
print ("time taken : ",end_time - start_time, " // ", cntOri, " | ", cntExp)

file1.close()

我只是注释了相似性函数来比较性能

我使用了这个网站的样本：

有什么想法吗？

我有一个类似的相似性实现，它也很慢，谢谢。好吧，也许相似性函数对于大数据使用来说不可避免地不那么实用……似乎商业网站[链接]与同义词表配合得很好。也许索引所有相关的单词可能会有帮助。但是请注意，垃圾邮件

thesarus.com

会导致愤怒，愤怒会导致仇恨，仇恨会导致痛苦……呵呵，我不是有意骚扰商业网站。不过谢谢尤达！