Python 文档相似性：有效地比较两个文档_Python_Mysql_Performance

Python 文档相似性：有效地比较两个文档

python mysql performance

Python 文档相似性：有效地比较两个文档,python,mysql,performance,Python,Mysql,Performance,我有一个循环来计算两个文档之间的相似性。它收集文档中的所有标记及其分数，并将它们放入字典中。然后比较字典这是我目前为止所做的，它工作正常，但速度非常慢： # Doc A cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0])) doca = cursor1.fetchall() #convert tuple to a dictionary doca_dic = dict((ro

我有一个循环来计算两个文档之间的相似性。它收集文档中的所有标记及其分数，并将它们放入字典中。然后比较字典

这是我目前为止所做的，它工作正常，但速度非常慢：

# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)

#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)

# loop through each token in doca and see if one matches in docb
for x in doca_dic:
    if docb_dic.has_key(x):
        #calculate the similarity by summing the products of the tf-idf_norm 
        similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity

我对Python还不太熟悉，所以这里很混乱。我需要加快速度，任何帮助都将不胜感激。

谢谢。

一个Python要点：

adict.has_key（k）

在Python2.X中已经过时，在Python3.X中消失了k作为一个表达式自Python 2.2以来就已经可用；用它代替。它将更快（无方法调用）

任何语言的一个实用点：迭代较短的词典

综合结果：

if len(doca_dic) < len(docb_dict):
    short_dict, long_dict = doca_dic, docb_dic
else:
    short_dict, long_dict = docb_dic, doca_dic
similarity = 0
for x in short_dict:
    if x in long_dict:
        #calculate the similarity by summing the products of the tf-idf_norm 
        similarity += short_dict[x] * long_dict[x]

上述代码的替代方案：这会做更多的工作，但它会用C而不是Python进行更多的迭代，并且可能会更快

similarity = sum(
    doca_dic[k] * docb_dic[k]
    for k in set(doca_dic) & set(docb_dic)
    )

Python代码的最终版本

# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
# Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
if len(doca) < len(docb):
    short_doc, long_doc = doca, docb
else:
    short_doc, long_doc = docb, doca
long_dict = dict(long_doc) # yes, it should be that simple
similarity = 0
for key, value in short_doc:
    if key in long_dict:
        similarity += long_dict[key] * value

值得检查的是，数据库表是否有适当的索引（例如，

token

上的表本身）。。。没有可用索引是使SQL查询运行非常缓慢的一种好方法

说明：在

令牌上设置索引可能会使现有查询或“在数据库中完成所有工作”查询或两者运行更快，这取决于数据库软件中查询优化程序的突发奇想和月亮的相位。如果没有可用的索引，DB将读取表中的所有行——这不好
创建索引：create index atable\u token\u idx on atable（token）
删除索引：drop index atable\u token\u idx
（但是一定要查阅数据库的文档）
把一些工作推到数据库上怎么样
通过连接，您可以得到一个基本相同的结果
    Token    A.tfidf_norm B.tfidf_norm
-----------------------------------------
    Apple      12.2          11.00
       ...
    Word       29.87         33.21
    Zealot      0.00         11.56
    Zulu       78.56          0.00

您只需扫描光标并执行操作
如果您不需要知道一个单词是否在一个文档中，而在另一个文档中缺少，则不需要外部联接，列表将是两个集合的交集。我上面包含的示例会自动为两个文档之一缺少的单词指定一个“0”。查看“匹配”函数需要什么。
一个sql查询就可以完成这项工作：
SELECT sum(index1.tfidf_norm*index2.tfidf_norm) FROM index index1, index index2 WHERE index1.token=index2.token AND index1.doc_id=? AND index2.doc_id=?

只需将“？”分别替换为2个文档id。“如果您不需要这两个字典做任何其他事情，您可以只创建A 1，并在B查询中弹出B（键，值）元组时对其进行迭代。”。当它们从我的查询中出来时，我如何迭代它们？（很抱歉，如果这是显而易见的）非常感谢你的帮助。我必须说这个网站和上面的人都很棒。哇，谢谢。有一个问题，“值得检查数据库表是否被适当地索引（例如，一个表本身就在令牌上）”，我不理解这一点。很抱歉，这可能是非常基本的。@seanieb：再一次，请看我更新的答案，它对“如果你不需要字典”的问题也有更好的回答。
    Token    A.tfidf_norm B.tfidf_norm
-----------------------------------------
    Apple      12.2          11.00
       ...
    Word       29.87         33.21
    Zealot      0.00         11.56
    Zulu       78.56          0.00

SELECT sum(index1.tfidf_norm*index2.tfidf_norm) FROM index index1, index index2 WHERE index1.token=index2.token AND index1.doc_id=? AND index2.doc_id=?