Python 尝试使用不同的文本文件作为“查找文本文件中的字数”；字典；_Python_String_Text_Count

Python 尝试使用不同的文本文件作为“查找文本文件中的字数”；字典；

python string text

Python 尝试使用不同的文本文件作为“查找文本文件中的字数”；字典；,python,string,text,count,Python,String,Text,Count,我有一个词汇表文件，其中包含我需要在其他文本文档中查找的单词。我需要找到每个单词的数量，如果有的话。例如：词汇表.txt： thought await thorough away red test.txt： I thought that if i await thorough enough, my thought would take me away. Away I thought the thought. 最后，我应该看到有4个思考实例，1个等待，2个离开，1个彻底，0个红色。我试过这样

我有一个词汇表文件，其中包含我需要在其他文本文档中查找的单词。我需要找到每个单词的数量，如果有的话。例如：

词汇表.txt：

thought
await
thorough
away
red

test.txt：

I thought that if i await thorough enough, my thought would take me away.
Away I thought the thought.

最后，我应该看到有4个思考实例，1个等待，2个离开，1个彻底，0个红色。我试过这样做：

for vocabLine in vocabOutFile:
    wordCounter = 0
    print >> sys.stderr, "Vocab word:", vocabLine
    for line in testFile:
        print >> sys.stderr, "Line 1 :", line
        if vocabLine.rstrip('\r\n') in line.rstrip('\r\n'):
            print >> sys.stderr, "Vocab word is in line"
            wordCounter = wordCounter + line.count(vocabLine)
            print >> sys.stderr, "Word counter", wordCounter
    testFile.seek(0, 0)

我有一种奇怪的感觉，由于vocab文件中的返回字符，它无法识别文件中的单词，因为在调试期间，我确定它正确地计算了匹配字符串末尾的所有单词。但是，使用rstrip（）后，计数仍然不正确。完成所有这些之后，我必须从歌手列表中删除出现次数不超过2次的单词

我做错了什么

谢谢

使用

regex

和

集合。计数器
import re
from collections import Counter
from itertools import chain

with open("voc") as v, open("test") as test:
    #create a set of words from vocabulary file
    words = set(line.strip().lower() for line in v) 

    #find words in test file using regex
    words_test = [ re.findall(r'\w+', line) for line in test ]

    #Create counter of words that are found in words set from vocab file
    counter = Counter(word.lower()  for word in chain(*words_test)\
                                          if word.lower() in words)
    for word in words:
        print word, counter[word]

输出
thought 4
away 2
await 1
red 0
thorough 1

使用regex
和集合。计数器
import re
from collections import Counter
from itertools import chain

with open("voc") as v, open("test") as test:
    #create a set of words from vocabulary file
    words = set(line.strip().lower() for line in v) 

    #find words in test file using regex
    words_test = [ re.findall(r'\w+', line) for line in test ]

    #Create counter of words that are found in words set from vocab file
    counter = Counter(word.lower()  for word in chain(*words_test)\
                                          if word.lower() in words)
    for word in words:
        print word, counter[word]

输出
thought 4
away 2
await 1
red 0
thorough 1

为你的单字编一本词典是个好主意
vocab_counter = {vocabLine.strip().lower(): 0 for vocabLine in vocabOutFile}

然后只扫描testFile一次（更有效），增加每个单词的计数
for line in testFile:
    for word in re.findall(r'\w+', line.lower()):
        if word in vocab_counter:
            vocab_counter[word] += 1

为你的单字编一本词典是个好主意
vocab_counter = {vocabLine.strip().lower(): 0 for vocabLine in vocabOutFile}

然后只扫描testFile一次（更有效），增加每个单词的计数
for line in testFile:
    for word in re.findall(r'\w+', line.lower()):
        if word in vocab_counter:
            vocab_counter[word] += 1

testFile
是一个文件对象吗？是的，testFile和vocabOutFile都是文件对象应该被计算在内吗？看起来是的。您应该将大小写规范化（例如在字符串上调用.lower（）
），那么您得到了什么输出？在第一次传递之后，testFile将位于末尾，以便在后续传递中跳过循环。您需要重新打开该文件或返回到开始处。它是testFile
文件对象？是的，testFile和vocabOutFile都是文件对象。是否应该计算“远离”？看起来是的。您应该将大小写规范化（例如在字符串上调用.lower（）
），那么您得到了什么输出？在第一次传递之后，testFile将位于末尾，以便在后续传递中跳过循环。您需要重新打开文件或从头开始查找这是一个很好的答案，但是这里有很多更高级的Python（列表理解、itertools.chain、generator、*args），可能最好再解释一下您的每一行代码是如何工作的。这是一个很好的答案，但是这里有很多更高级的Python（列表理解，itertools.chain，generator，*args），也许最好再多解释一下您的每一行代码是如何工作的。嘿，我不知道您是否还在看这篇文章，但是编译器的代码段有语法问题：vocabLine.strip（）.lower（）：0对于vocabOutFile中的vocabLine，它不喜欢for语句reason@FeralShadow，这是一种听写理解。它仅适用于Python2.7或更高版本。对于Python2.6，您可以使用dict（（vocabLine.strip（）.lower（），0）表示vocabOutFile中的vocabLine）啊哈，好的！奇怪的是我怎么没有2.7。谢谢所以，还有一个问题：这不是增加任何单词的计数。它的行为就像在dict中找不到单词一样。它正确地读取每一行，并在第二个FOR循环中正确地查看每个单词，但是IF语句永远不会为真。啊！对不起，我错过了你提供的一些代码。它工作得很好！谢谢你，格尼布勒！嘿，我不知道你是否还在看这个，但是编译器的段有一个语法问题：vocabLine.strip（）.lower（）：0对于vocabOutFile中的vocabLine它不喜欢for语句reason@FeralShadow，这是一种听写理解。它仅适用于Python2.7或更高版本。对于Python2.6，您可以使用dict（（vocabLine.strip（）.lower（），0）表示vocabOutFile中的vocabLine）啊哈，好的！奇怪的是我怎么没有2.7。谢谢所以，还有一个问题：这不是增加任何单词的计数。它的行为就像在dict中找不到单词一样。它正确地读取每一行，并在第二个FOR循环中正确地查看每个单词，但是IF语句永远不会为真。啊！对不起，我错过了你提供的一些代码。它工作得很好！谢谢你，格尼布勒！