Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/278.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python-在文本文件中查找重复单词_Python - Fatal编程技术网

Python-在文本文件中查找重复单词

Python-在文本文件中查找重复单词,python,Python,我想知道你是否能帮我解决python编程问题?我目前正试图编写一个程序,读取一个文本文件并输出“word 1 True”,如果该单词之前已经出现在该文件中,或者如果该单词是第一次出现,则输出“word 1 False” 以下是我的想法: fh = open(fname) lst = list () for line in fh: words = line.split() for word in words: if word in words:

我想知道你是否能帮我解决python编程问题?我目前正试图编写一个程序,读取一个文本文件并输出“word 1 True”,如果该单词之前已经出现在该文件中,或者如果该单词是第一次出现,则输出“word 1 False”

以下是我的想法:

fh = open(fname)
lst = list ()
for line in fh:
    words = line.split()
    for word in words:
        if word in words:
            print("word 1 True", word)
        else:
            print("word 1 False", word)
但是,它只返回“word 1 True”

请告知

谢谢

实现这一点的简单(快速)方法是使用python字典。这些可以看作是一个数组,但是索引键是一个字符串而不是一个数字

这会产生一些代码片段,如:

found_words = {}    # empty dictionary
words1 = open("words1.txt","rt").read().split(' ')  # TODO - handle punctuation
for word in words1:
    if word in found_words:
        print(word + " already in file")
    else:
        found_words[word] = True    # could be set to anything

现在,在处理单词时,只需检查字典中是否已存在该单词,即可表明该单词已被看到。

此截取的代码不使用该文件,但易于测试和研究。主要区别在于,您必须像在示例中那样加载文件并按行读取

example_file = """
This is a text file example

Let's see how many time example is typed.

"""
result = {}
words = example_file.split()
for word in words:
    # if the word is not in the result dictionary, the default value is 0 + 1
    result[word] = result.get(word, 0) + 1
for word, occurence in result.items():
    print("word:%s; occurence:%s" % (word, occurence))
更新:

正如@khachik所建议的,更好的解决方案是使用
计数器

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

您可能还希望跟踪以前的位置,例如:

with open(fname) as fh:
    vocab = {}
    for i, line in enumerate(fh):
       words = line.split()
       for j, word in enumerate(words):
           if word in vocab:
               locations = vocab[word]
               print word "occurs at", locations
               locations.append((i, j))
           else:
               vocab[word] = [(i, j)]
               # print "First occurrence of", word

按照您的路线,您可以执行以下操作:

with open('tyger.txt', 'r') as f:
    lines = (f.read()).split()
    for word in lines:
        if lines.count(word) > 1:
            print(f"{word}: True")
        else:
            print(f"{word}: Flase")
输出

你也可以数一数每一个字:

with open('tyger.txt', 'r') as f:
    count = {}
    lines = f.read()
    lines = lines.split()
    for i in lines:
        count[i] = lines.count(i)
    print(count)
输出

您可以这样使用字典:

for k in count:
    if count[k] > 1:
        print(f"{k}: True")
    else:
        print(f"{k}: False")
输出


您需要一个额外的
集合
来查找单词是否已包含,如果未包含,则将其添加到集合中。来自
单词
的每个
单词
都将显示在
单词
中,因此该测试只是说
如果为真:
的昂贵方式。如果您要查找重复项,则需要计数。为什么需要遍历它两次?非常好的反馈!如果你想那样做的话,最好使用
collections.Counter
。谢谢@khachik,我不知道计数器。谢谢。这是pythonic;)
{'When': 1, 'the': 2, 'stars': 1, 'threw': 1, 'down': 1, 'their': 2,
'spears': 1, 'And': 1, "water'd": 1, 'heaven': 1, 'with': 1, 'tears:':
1, 'Did': 2, 'he': 2, 'smile': 1, 'his': 1, 'work': 1, 'to': 1,
'see?': 1, 'who': 1, 'made': 1, 'Lamb': 1, 'make': 1, 'thee?': 1}
for k in count:
    if count[k] > 1:
        print(f"{k}: True")
    else:
        print(f"{k}: False")
When: False
the: True
stars: False
threw: False
down: False
their: True
spears: False