Python 2.7 对于使用python的信息检索课程，访问给定的tf idf权重_Python 2.7

Python 2.7 对于使用python的信息检索课程，访问给定的tf idf权重

python-2.7

Python 2.7 对于使用python的信息检索课程，访问给定的tf idf权重,python-2.7,Python 2.7,我正在执行这个python程序，我必须访问：这就是我试图用代码实现的：返回dict映射doc_id到length，计算为sqrtsumw_i**2，其中w_i是文档中每个术语的tf idf权重。例如，在下面的示例索引中，文档0有两个术语“a”，其中 tf idf重量3和带有tf idf重量4的“b”。它的长度是因此，5=sqrt9+16 >>> lengths = Index().compute_doc_lengths({'a': [[0, 3]], 'b': [

我正在执行这个python程序，我必须访问：这就是我试图用代码实现的：返回dict映射doc_id到length，计算为sqrtsumw_i**2，其中w_i是文档中每个术语的tf idf权重。例如，在下面的示例索引中，文档0有两个术语“a”，其中 tf idf重量3和带有tf idf重量4的“b”。它的长度是因此，5=sqrt9+16

    >>> lengths = Index().compute_doc_lengths({'a': [[0, 3]], 'b': [[0,4]]})
    >>> lengths[0]
    5.0

我的代码是：圣殿骑士=[] 对于索引中的iter.values：圣殿骑士 d=默认目录列表对于圣殿骑士[1]中的i，l： d[i].附录 lent=defaultdict d系硕士： lo=数学量sqrtsumlent[m]**2

返回lo

因此，如果我理解正确，我们必须转换输入字典：

ind = {'a':[ [1,3] ], 'b': [ [1,4 ] ] }

{1:5}

到输出字典：

ind = {'a':[ [1,3] ], 'b': [ [1,4 ] ] }

{1:5}

其中，5被计算为输入字典值部分的欧几里德距离，在这种情况下，向量[3,4]正确吗

考虑到这些信息，答案变得更加直截了当：

def calculate_length(ind):
    # Frist, let's transform the dictionary into a list of doc_id, tl_idf pairs; [[doc_id_1,tl_idf_1],...]
    data = [entry[0] for entry in ind.itervalues()] # use just ind.values() in python 3.X
    # Next, let's split that list into two, one for doc_id's, one for tl_idfs
    doc_ids, tl_idfs = zip(*data)
    # We can just assume that all the doc_id's are the same. you could check that here if you wanted
    doc_id = doc_ids[0]
    # Next, we calculate the length as per our formula
    length = sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))
    # Finally, we return the output dictionary
    return {doc_id: length}

例如：

>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1:5.0}

在这里有几个地方，你可以优化它来删除中间列表。这个方法可以是两行操作和一个返回，但我将留给你去发现，因为这是一个家庭作业。我还希望您花时间真正理解这段代码的作用，而不是全盘复制它

还要注意的是，这个答案做出了一个非常大的假设，即所有doc_id值都是相同的，并且字典中每个键上都只有一个doc_id、tl_idf列表！如果不是这样，则变换会变得更复杂。但您没有提供示例输入或文本解释，这表明情况就是这样，但基于数据结构，我认为很可能是这样

使现代化事实上，这真的让我很困扰，因为我认为确实如此。以下是一个解决更复杂情况的版本：

from itertools import chain
from collections import defaultdict

def calculate_length(ind):
    # We want to transform this first into a dict of {doc_id:[tl_idf_a,...]}
    # First we transform it into a generator of ([doc_id,tl_idf],...)
    tf_gen  = chain.from_iterable(ind.itervalues())
    # which we then use to generate our transformed dictionary
    tf_dict = defaultdict(list)
    for doc_id, tl_idf in tf_gen:
        tf_dict[doc_id].append(tl_idf)
    # Now we proceed mostly as before, but we can just do it in one line
    return dict((doc_id, sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))) for doc_id, tl_idfs in tf_dict.iteritems())

示例用法：

>>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1: 5.0}
>>> calculate_length({'a':[ [1,3],[2,3] ], 'b': [ [1,4 ], [2,1] ] })
{1: 5.0, 2: 3.1622776601683795}

你能澄清一下你的意图吗？事实上，您的数据结构有点奇怪；您有一个包含两个元素的列表，该列表包含在没有其他元素的另一个列表中。为什么不只列出一份清单呢？每个字典键是否可能有多个[doc_id，value]集？您试图在整个集合中执行的操作是什么？换句话说，描述问题本身，而不仅仅是您对问题解决方案的实现。不幸的是，您的实现不够清晰，无法作为问题的描述。教授告诉我如何使用。这就是我试图用代码实现的：返回dict映射doc_id到length，计算为sqrtsumw_i**2，其中w_i是文档中每个术语的tf idf权重。例如，在下面的样本索引中，文件0有两个术语“a”和“b”，前者的tf idf权重为3，后者的tf idf权重为4。因此，它的长度为5=sqrt9+16length=Index.compute_doc_length{'a'：[[0,3]]，'b'：[[0,4]}>>>length[0]5.0好的，这很有帮助。请用适当的格式将该评论编辑成您的原始问题，以便每个人都能看到它，从而更容易阅读。只需要5.0的输出，而不是{1:5.0}，并且您的解释是有意义的。我将以此为基础工作。感谢you@sjain此外，如果这解决了您的问题，请接受答案单击答案旁边的灰色复选标记，以便其他人知道它解决了问题。