Python 2.7 计算文件中的字符串，一些单字，一些完整的句子_Python 2.7_Text_Counter_Word Count

Python 2.7 计算文件中的字符串，一些单字，一些完整的句子

python-2.7 text

Python 2.7 计算文件中的字符串，一些单字，一些完整的句子,python-2.7,text,counter,word-count,Python 2.7,Text,Counter,Word Count,我想计算文件中某些单词和名称的出现次数。下面的代码错误地将鱼和薯条计数为一箱鱼和一箱薯条，而不是一箱鱼和薯条 ngh.txt = 'test file with words fish, steak fish chips fish and chips' import re from collections import Counter wanted = ''' "fish and chips" fish chips steak ''' cnt = Counter() words = re.fin

我想计算文件中某些单词和名称的出现次数。下面的代码错误地将

鱼和薯条

计数为一箱

鱼

和一箱

薯条

，而不是一箱

鱼和薯条

ngh.txt = 'test file with words fish, steak fish chips fish and chips'

import re
from collections import Counter
wanted = '''
"fish and chips"
fish
chips
steak
'''
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
    if word in wanted:
        cnt[word] += 1
print cnt

输出：

Counter({'fish': 3, 'chips': 2, 'and': 1, 'steak': 1})

我想要的是：

Counter({'fish': 2, 'fish and chips': 1, 'chips': 1, 'steak': 1})

（理想情况下，我可以得到如下输出：

fish: 2
fish and chips: 1
chips: 1
steak: 1

)

因此此解决方案适用于您的测试数据（以及测试数据中添加的一些术语，只是为了更全面），尽管它可能会得到改进

它的关键是在单词列表中找到“and”的出现，然后用一个复合词替换“and”及其相邻词（将相邻词与“and”连接起来），并将其与“and”的副本一起添加到列表中

我还将“通缉”字符串转换为一个列表，以将“炸鱼薯条”字符串作为一个不同的项处理

import re
from collections import Counter

# changed 'wanted' string to a list
wanted = ['fish and chips','fish','chips','steak', 'and']

cnt = Counter()

words = re.findall('\w+', open('ngh.txt').read().lower())

for word in words:

    # look for 'and', replace it and neighbours with 'comp_word'
    # slice, concatenate, and append to make new words list

    if word == 'and':
        and_pos = words.index('and')
        comp_word = str(words[and_pos-1]) + ' and '  +str(words[and_pos+1])
        words = words[:and_pos-1] + words[and_pos+2:]
        words.append(comp_word)
        words.append('and')

for word in words:
    if word in wanted:
        cnt[word] += 1

print cnt

文本的输出将是：

Counter({'fish':2, 'and':1, 'steak':1, 'chips':1, 'fish and chips':1})

正如上面的评论所指出的，不清楚为什么您希望/期望理想产量中的鱼产量为2，薯条产量为2，鱼和薯条产量为1。我假设这是一个输入错误，因为上面的输出有

'chips'：1

我建议使用两种算法来处理任何模式和任何文件。第一个算法的运行时间与（文件中的字符数）*模式数成比例

1> 对于每个模式，搜索所有模式并创建超级模式列表。这可以通过将一个模式（如“cat”）与要搜索的所有模式进行匹配来实现

patterns = ['cat', 'cat and dogs', 'cat and fish']
superpattern['cat']  = ['cat and dogs', 'cat and fish']

2> 在文件中搜索“cat”，假设结果是cat\u计数 3> 现在搜索文件中每个“猫”的晚餐模式，并获取它们的计数

for (sp  in superpattern['cat']) :
    sp_count = match sp in file.
    cat_count = cat_count - sp

这是一个普遍的解决办法，即暴力。如果我们将模式安排在一个Trie中，应该能够得到一个线性时间解。根-->f-->i-->s-->h-->a等。现在，当你们在鱼的h，你们并没有得到一个a，增加鱼的数量，然后去根。如果你得了a，继续。任何时候，当您得到一些不期望的东西时，增加最近找到的模式的计数，并转到根节点或其他节点（最长的匹配前缀是该其他节点的后缀）。这是Aho Corasick算法，您可以在维基百科或以下网址查找：

此解决方案与文件中的字符数成线性关系。

定义：

所需项目：在文本中搜索的字符串

要计算所需项目的数量，而不在较长的所需项目中重新计算它们，请首先计算每个项目在字符串中出现的次数。接下来，从最长到最短遍历所需项目，当遇到出现在较长所需项目中的较小所需项目时，从较短项目中减去较长项目的结果数。例如，假设您想要的项目是“a”、“a b”和“a b c”，而您的文本是“a/a/a b/a b c”。搜索每一个单独的结果：{“a”：4，“ab”：2，“abc”：1}。期望的结果是：{“abc”：1，“abb”：（abc）-（abc）=2-1=1，“abc”）-（abc）-（abc）-（abc）=4-1-1=2}

def get_word_counts(text, wanted):
    counts = {}; # The number of times a wanted item was read

    # Dictionary mapping word lengths onto wanted items
    #  (in the form of a dictionary where keys are wanted items)
    lengths = {}; 

    # Find the number of times each wanted item occurs
    for item in wanted:
        matches = re.findall('\\b' + item + '\\b', text);

        counts[item] = len(matches)

        l = len(item) # Length of wanted item

        # No wanted item of the same length has been encountered
        if (l not in lengths):
            # Create new dictionary of items of the given length
            lengths[l] = {}

        # Add wanted item to dictionary of items with the given length
        lengths[l][item] = 1

    # Get and sort lenths of wanted items from largest to smallest
    keys = lengths.keys()
    keys.sort(reverse=True)

    # Remove overlapping wanted items from the counts working from
    #  largest strings to smallest strings
    for i in range(1,len(keys)):
        for j in range(0,i):
            for i_item in lengths[keys[i]]:
                for j_item in lengths[keys[j]]:
                    #print str(i)+','+str(j)+': '+i_item+' , '+j_item
                    matches = re.findall('\\b' + i_item + '\\b', j_item);

                    counts[i_item] -= len(matches) * counts[j_item]

    return counts

以下代码包含测试用例：

tests = [
    {
        'text': 'test file with words fish, steak fish chips fish and '+
            'chips and fries',
        'wanted': ["fish and chips","fish","chips","steak"]
    },
    {
        'text': 'fish, fish and chips, fish and chips and burgers',
        'wanted': ["fish and chips","fish","fish and chips and burgers"]
    },
    {
        'text': 'fish, fish and chips and burgers',
        'wanted': ["fish and chips","fish","fish and chips and burgers"]
    },
    {
        'text': 'My fish and chips and burgers. My fish and chips and '+
            'burgers',
        'wanted': ["fish and chips","fish","fish and chips and burgers"]
    },
    {
        'text': 'fish fish fish',
        'wanted': ["fish fish","fish"]
    },
    {
        'text': 'fish fish fish',
        'wanted': ["fish fish","fish","fish fish fish"]
    }
]

for i in range(0,len(tests)):
    test = tests[i]['text']
    print test
    print get_word_counts(test, tests[i]['wanted'])
    print ''

结果如下：

test file with words fish, steak fish chips fish and chips and fries
{'fish and chips': 1, 'steak': 1, 'chips': 1, 'fish': 2}

fish, fish and chips, fish and chips and burgers
{'fish and chips': 1, 'fish and chips and burgers': 1, 'fish': 1}

fish, fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 1, 'fish': 1}

My fish and chips and burgers. My fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 2, 'fish': 0}

fish fish fish
{'fish fish': 1, 'fish': 1}

fish fish fish
{'fish fish fish': 1, 'fish fish': 0, 'fish': 0}

你所期望的真是令人困惑。如果你想计算所有出现的鱼，它应该是3。如果要排除它，它应该是2。然而，如果你排除了出现在其他通缉物品中的通缉物品，鱼应该是2，薯条应该是1。不过，你有鱼2份，薯条2份。我很抱歉，戴夫，我已经编辑了这个问题。谢谢，并为拼写错误道歉！不用担心，最后一切都安排好了。希望这个答案对你有用。我不确定我是否明白。你的意思是我必须在运行程序之前手动计算单词的频率吗？是的，但是你可以使用相同的模式匹配来完成。但是，我无法给出一个通用公式，说明如何使用它来获得结果。太棒了。非常感谢你，戴夫。