Python 创建单词索引_Python - Fatal编程技术网

Python 创建单词索引

python

Python 创建单词索引,python,Python,我目前正在尝试创建单词索引，从文本文件中读取每一行，并检查单词是否在该行中。如果是，则打印出号码行并继续检查。当我打印每个单词和行号时，我已经让它按照我想要的方式工作，但是我不确定我可以使用什么存储系统来包含每个数字代码示例： def index(filename, wordList): 'string, list(string) ==> string & int, returns an index of words with the line number\ e

我目前正在尝试创建单词索引，从文本文件中读取每一行，并检查单词是否在该行中。如果是，则打印出号码行并继续检查。当我打印每个单词和行号时，我已经让它按照我想要的方式工作，但是我不确定我可以使用什么存储系统来包含每个数字

代码示例：

def index(filename, wordList):
    'string, list(string) ==> string & int, returns an index of words with the line number\
    each word occurs in'
    indexDict = {}
    res = []
    infile = open(filename, 'r')
    count = 0
    line = infile.readline()
    while line != '':
        count += 1
        for word in wordList:
            if word in line:
                #indexDict[word] = [count]
                print(word, count)
        line = infile.readline()
    #return indexDict

dict = {1: [], 2: [], 3: []}

list = [1,2,2,2,3,3]

for k in dict.keys():
    for i in list:
        if i == k:
            dict[k].append(i)


In [7]: dict
Out[7]: {1: [1], 2: [2, 2, 2], 3: [3, 3]}

这会打印单词和当时的计数（行号），但我要做的是存储数字，以便以后我可以打印出来

word linenumber

word2 linenumber, linenumber

等等。我觉得如果我把每个行号都放在一个列表中，这样每个键都可以包含多个值，字典就可以做到这一点，但我得到的最接近的结果是：

{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [120], 'evil': [106], 'demon': [122]}

当我想让它显示为：

{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [44, 53, 55, 64, 78, 97, 104, 111, 118, 120], 'evil': [99, 106], 'demon': [122]}

有什么想法吗？

试试这样的方法：

import collections
def index(filename, wordList):
    indexDict = collections.defaultdict(list)
    with open(filename) as infile:
        for (i, line) in enumerate(infile.readlines()):
            for word in wordList:
                if word in line:
                    indexDict[word].append(i+1)
    return indexDict

这将产生与您的示例（使用Poe的Raven）中完全相同的结果

可选地，您可以考虑使用普通<代码> DICT<代码>，而不是使用<代码> Debug Truts<代码>，并用列表中的所有单词初始化它；确保

indexDict

包含一个条目，即使该条目不在文本中

另外，请注意

枚举

的用法。这个内置函数对于迭代某个列表的索引和该索引处的项（如文件中的行）非常有用。

可能有一种更具python风格的编写方法，但为了可读性，您可以尝试以下方法（一个简单的示例）：

如果列表已经存在，则需要将下一项附加到列表中

即使是第一次找到一个单词，要使列表已经存在，最简单的方法是使用跟踪单词到行的映射：

from collections import defaultdict

def index(filename, wordList):
    indexDict = defaultdict(list)
    with open(filename, 'r') as infile:
        for i, line in enumerate(infile):
            for word in wordList:
                if word in line:
                    indexDict[word].append(i)
                    print(word, i)

    return indexDict

我使用最佳实践简化了您的代码；以上下文管理器的形式打开文件，以便完成后自动关闭，并使用

enumerate（）

动态创建行号

如果您将行转换为一组单词（

set（line.split（））

，但这可能不会删除标点符号），则可以进一步加快速度（并使其更加准确），因为这样您就可以对

单词列表

（也是一组）使用集合交叉测试，这可能会更快地找到匹配的单词。

您正在用此行替换旧值

indexDict[word] = [count]

改成

indexDict[word] = indexDict.setdefault(word, []) + [count]

会给出你想要的答案。它将获取indexDict[word]的当前值并向其追加新的计数，如果没有indexDict[word]，它将创建一个新的空列表并向其追加计数。

您可能需要一个默认为[]的defaultdict来表示新键并追加。当然，您的注释行每次只需重新写入带有1项列表的密钥。对于所有回答的人，感谢您的输入。非常感谢。对于行号，从1开始而不是从零开始是有意义的。您可以使用

enumerate（infle，1）

，然后再使用

。append（i）

这将追加行，而不是行号！我使用数字只是为了显示逻辑，以为已经有一行索引值要附加，“count+=1”。懒鬼会懒散的，这正是我需要的。非常感谢。@iKyriaki:defaultdict的

defaultdict

解决方案也做了同样的事情，语法更加简洁。你选择字典，所以我用字典的方法来帮助你。我不明白为什么有些人用集合重写你的代码。你的

count

变量怎么了？应该是：

indexDict[word].append（count）'be this'indexDict[word].append（i）

？那么

res

是干什么的呢？另外，如果你是唯一一个提到

defaultdict

的人，你也可以提到

集合。Counter

，尽管我不太了解OP的用例，不知道是否应该计算实例。糟糕的编辑；谢谢你指出这些错误。我曾考虑过提及

计数器

，但没有提及；这是API对于这个用例的过度杀伤力。