Python 如何从文件中读取文本，识别相邻的重复单词，并报告它们在文本文件中的位置？_Python

Python 如何从文件中读取文本，识别相邻的重复单词，并报告它们在文本文件中的位置？

python

Python 如何从文件中读取文本，识别相邻的重复单词，并报告它们在文本文件中的位置？,python,Python,我试图从一个文本文件中读取一段引语，并找到任何出现在彼此旁边的重复单词。报价如下： "He that would make his own liberty liberty secure, must guard even his enemy from oppression; for for if he violates this duty, he he establishes a precedent that will reach to himself." -- Thomas Paine

我试图从一个文本文件中读取一段引语，并找到任何出现在彼此旁边的重复单词。报价如下：

"He that would make his own liberty liberty secure,

must guard even his enemy from oppression;

for for if he violates this duty, he

he establishes a precedent that will reach to himself."
-- Thomas Paine

输出应如下所示：

在第1行找到“自由”一词

在第3行找到单词“for”

在第4行找到“他”这个词

我已经编写了从文件中读取文本的代码，但是我在识别重复的代码时遇到了问题。我尝试枚举文件中的每个单词，并检查一个索引中的单词是否等于下一个索引中的单词。但是，我得到了一个索引错误，因为循环在索引范围之外继续。以下是我到目前为止的想法：

import string
file_str = input("Enter file name: ")
input_file = open(file_str, 'r')

word_list = []
duplicates = []

for line in input_file:
    line_list = line_str.split()
    for word in line_list:
        if word != "--":
            word_list.append(word)

for idx, word in enumerate(word_list):
    print(idx, word)
    if word_list[idx] == word_list[idx + 1]:
        duplicates.append(word)

如果您对我正在尝试的当前方法有任何帮助，或者对其他方法有任何建议，我们将不胜感激。

这应该可以完成技巧操作。在单词列表上的for循环中，它现在只上升到倒数第二个元素。不过，这不会跟踪行号，我会使用Phillip Martin的解决方案

import string

file_str = input("Enter file name: ")
input_file = open(file_str, 'r')

word_list = []
duplicates = []

for line in input_file:
    line_list = line.split()
    for word in line_list:
        if word != "--":
            word_list.append(word)
#Here is the change I made         >     <
for idx, word in enumerate(word_list[:-1]):
    print(idx, word)
    if word_list[idx] == word_list[idx + 1]:
        duplicates.append(word)
print duplicates

导入字符串
文件\u str=input（“输入文件名：”）
输入文件=打开（文件“r”）
单词列表=[]
重复项=[]
对于输入_文件中的行：
line_list=line.split（）
对于第_行列表中的单词：
如果是单词！="--":
word\u列表。追加（word）
#这是我做的更改><
对于idx，枚举中的单词（单词列表[：-1]）：
打印（idx，word）
如果单词列表[idx]==单词列表[idx+1]：
重复项。追加（word）
打印副本

当您记录

单词列表时，您正在丢失有关单词所在行的信息
也许更好的办法是在阅读这些行时确定重复项
line_number = 1
for line in input_file:
    line_list = line_str.split()
    previous_word = None
    for word in line_list:
        if word != "--":
            word_list.append(word)
        if word == previous_word:
            duplicates.append([word, line_number])
        previous_word = word
    line_number += 1

这是另一种方法
from itertools import tee, izip
from collections import defaultdict

dups = defaultdict(set)
with open('file.txt') as f:
    for no, line in enumerate(f, 1):
        it1, it2 = tee(line.split())
        next(it2, None)
        for word, follower in izip(it1, it2):
            if word != '--' and word == follower:
                dups[no].add(word)

产生
>>> dups
defaultdict(<type 'set'>, {1: set(['liberty']), 3: set(['for'])})

（我不知道你为什么期望在第四行找到“他”，在你的示例文件中它肯定不会加倍。）
如果两个相同的单词被换行符分隔怎么办？如果单词列表[idx]==word\u列表[idx+1]
在idx是最后一个索引时会越界。您必须跳过第一步，检查上一个元素而不是下一个元素。
>>> dups[3]
set(['for'])