Pythonic：收集任意字符串-索引器_Python_Python 2.x

Pythonic：收集任意字符串-索引器

python

Pythonic：收集任意字符串-索引器,python,python-2.x,Python,Python 2.x,首先，下面的代码按原样工作。我更像是一个Ruby程序员，所以我仍然在Python中摸索着前进，我相信一定会有一种更干练的方法来完成我下面要做的事情我正在构建一个索引器，它创建一个术语字典，这些术语在文档中随着计数重复，然后输出带有计数的术语。现在它最多支持四个单词的短语。有没有更好的方法让我抽象出这种逻辑，这样我就可以做同样的事情，但对于任意长度的短语，我不需要添加越来越多的条件 import sys file=open(sys.argv[1],"r") wordcount = {} last

首先，下面的代码按原样工作。我更像是一个Ruby程序员，所以我仍然在Python中摸索着前进，我相信一定会有一种更干练的方法来完成我下面要做的事情

我正在构建一个索引器，它创建一个术语字典，这些术语在文档中随着计数重复，然后输出带有计数的术语。现在它最多支持四个单词的短语。有没有更好的方法让我抽象出这种逻辑，这样我就可以做同样的事情，但对于任意长度的短语，我不需要添加越来越多的条件

import sys
file=open(sys.argv[1],"r")
wordcount = {}
last_word = ""
last_last_word = ""
last_last_last_word = ""

for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1

    if last_last_last_word != "":
        if "{} {} {} {}".format(last_last_last_word,last_last_word,last_word,word) not in wordcount:
            wordcount[last_last_last_word + " " + last_last_word + " " + last_word + " " + word ] = 1
        else: 
            wordcount[last_last_last_word + " " + last_last_word + " " + last_word + " " + word ] += 1
    last_last_last_word = last_last_word

    if last_last_word != "":
        if last_last_word + " " + last_word + " " + word not in wordcount:
            wordcount[last_last_word + " " + last_word + " " + word ] = 1
        else: 
            wordcount[last_last_word + " " + last_word + " " + word ] += 1
    last_last_word = last_word

    if last_word != "":
        if last_word + " " + word not in wordcount:
            wordcount[last_word + " " + word] = 1
        else: 
            wordcount[last_word + " " + word] += 1
    last_word = word

for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True):
    print k,v

我包含了更广泛的输入和输出示例。我为这段代码的长度感到抱歉，但是这段代码的性质会倾向于创建大的输出

此输入：

this is a sample input file an input file will always be all lower case with no punctuation

生成此输出：

file 2
input 2
input file 2
an input file 1
all 1
lower case 1
be 1
is 1
file will always 1
an 1
sample 1
case 1
always be all lower 1
this is a 1
will always be 1
sample input file 1
will always 1
is a sample 1
all lower 1
lower case with no 1
no 1
with 1
with no 1
file will always be 1
with no punctuation 1
lower 1
be all lower case 1
no punctuation 1
an input file will 1
input file an 1
file an 1
input file an input 1
always be 1
file an input file 1
be all 1
is a 1
input file will 1
file will 1
an input 1
input file will always 1
will always be all 1
always be all 1
lower case with 1
a sample 1
a sample input file 1
a sample input 1
is a sample input 1
be all lower 1
a 1
sample input file an 1
sample input 1
case with no punctuation 1
all lower case with 1
this 1
always 1
file an input 1
case with 1
case with no 1
will 1
all lower case 1
punctuation 1
this is 1
this is a sample 1

请注意，每个单词、每对单词、每三个单词和每四个单词都已计算过。我想擦干这段代码，这样我就可以让这个返回计数到任意一组单词。

这是对您的代码的快速重构，是您的朋友

这将使用您想要使用它的字数作为第二个参数

import sys
from collections import defaultdict

file=open(sys.argv[1],"r")

wordcount = defaultdict(int)
wordlist = ["" for i in range(int(sys.argv[2]))]

def check(wordcount, wordlist, word):

    wordlist.append(word)
    for i, word in enumerate(wordlist):
        if word != "":
            current = "".join([w + " " for w in wordlist[i:]])
            wordcount[current] += 1

    return wordlist[1:]

for word in file.read().split():
    wordlist = check(wordcount, wordlist, word)

for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True):
    print k,v

更新让它变得更懒

from collections import Counter
import itertools
import operator as op


def count_phrases(words, phrase_len):
    return reduce(op.add, 
    (Counter(tuple(words[i:i+l]) for i in xrange(len(words)-l+1)) for l in phrase_len))

例如：

words = "a b c a a".split()
for phrase, count in count_phrases(words, [1, 2]).iteritems():
    print " ".join(phrase), counts

输出：

b c 1
a 3
c 1
b 1
c a 1
a a 1
a b 1

选中此项：

def parser(data,size):
    chunked = data.split()
    phrases = []
    for i in xrange(len(chunked)-size):
        phrase=' '.join(chunked[i:size+i])
        phrases.append(phrase)
    return phrases

def parse_file(fname,size):    
    result = []
    with open(fname,'r') as f:    
        for data in f.readlines():
            for i in xrange(1,size):
                result+=parser(data.strip(),i)

    return Counter(result)


result= parse_file('file.txt',4) 
print sorted(result.items(),key=lambda x:x[1],reverse=True)

[('file', 2),
 ('input', 2),
 ('input file', 2),
 ('an input file', 1),
 ('all', 1),
 ('always be all', 1),
 ('is', 1),
 ('an', 1),
 ('sample', 1),
 ('this is a', 1),
 ('will always be', 1),
 ('sample input file', 1),
 ('will always', 1),
 ('is a sample', 1),
 ('all lower', 1),
 ('no', 1),
 ('with no', 1),
 ('lower case', 1),
 ('case', 1),
 ('input file will', 1),
 ('case with no', 1),
 ('input file an', 1),
 ('file an', 1),
 ('be', 1),
 ('always be', 1),
 ('be all lower', 1),
 ('be all', 1),
 ('lower', 1),
 ('is a', 1),
 ('an input', 1),
 ('a sample input', 1),
 ('lower case with', 1),
 ('a sample', 1),
 ('file will', 1),
 ('with', 1),
 ('a', 1),
 ('file will always', 1),
 ('sample input', 1),
 ('this', 1),
 ('always', 1),
 ('file an input', 1),
 ('case with', 1),
 ('will', 1),
 ('all lower case', 1),
 ('this is', 1)]

好了，伙计。我想你一直在找的就是这个

string="this is a sample input file an input file will always be all lower case with no punctuation"

def words(count):
    return [" ".join(string.split()[a:b]) for a in range(len(string.split())) for b in range(a+count+1) if len(string.split()[a:b]) == count]

它基于对输入文本进行切片并返回适当长度的短语列表

用你一直在寻找的序列的长度调用列表

lst = words(3)

当你用循环查找结果时

for word in set(lst):
    print word, lst.count(word)

an input file 1
file will always 1
is a sample 1
be all lower 1
file an input 1
with no punctuation 1
input file will 1
lower case with 1
this is a 1
always be all 1
will always be 1
sample input file 1
a sample input 1
all lower case 1
case with no 1
input file an 1

是的，正如评论所说，这是一种低效的方法，所以我必须为此道歉

您声明希望按任意长度提取短语，因此，如果我的第一个假设不正确，这里有另一个解决方案，可以在不使用.count（）方法的情况下为您提供短语组合的计数

但是通过使用此选项，整个文本也算作整个短语，因此请确保您确实确定这些短语的长度

words_list = string.split()
words_dict = {}

for a in range(len(words_list)):
    for b in range(a):
        phrase = " ".join(words_list[b:a])
        if phrase in words_dict:
            words_dict[phrase] += 1
        else:
            words_dict[phrase] = 1

for i in words_dict:
    print i, words_dict[i]

给你所有的长度。

谦虚的贡献

import sys
file=open(sys.argv[1],"r")
wordcount = {}
nb_words = 4
last_words = []

for word in file.read().split():
    last_words = [word] + last_words 
    if len (last_words) > nb_words:
        last_words.pop()
    for i in range(len(last_words)-1,-1,-1):
        if last_words[i] != "":
            key = ' '.join(last_words[:i+1])
            if key not in wordcount:
                wordcount[key] = 1
            else: 
                wordcount[key] += 1

for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True):
    print k,v

我编写了一个循环来替换变量。所以现在你有了一个参数，可以覆盖4个单词。

编辑：经过一些错误修复后，我现在确信它会产生相同的输出

如果您关心的是一个大文件（可能是一个甚至没有允许逐行迭代的行结尾的文件），那么您可以对其进行内存映射（保持内存使用率较低），并使用正则表达式隔离所有小写单词，创建一个包含N个单词的滑动窗口，然后适当地更新

计数器

，例如：

import re
import mmap
from itertools import islice, izip, tee
from collections import Counter
from pprint import pprint

def word_grouper(filename, size):
    counts = Counter()
    with open(filename) as fin:
        mm = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
        words = (m.group() for m in re.finditer('[a-z]+', mm))
        sliding = [islice(w, n, None) for n, w in enumerate(tee(words, size+1))]
        for slide in izip(*sliding):
            counts.update(slide[:n] for n in range(1, len(slide)))

    return counts

counts = word_grouper('input filename', 4)
# do appropriate formatting instead of just `pprint`ing
pprint(counts.most_common())

示例输出（其中输入文件包含示例字符串）：

那么你说的四个单词短语是什么意思？你能给我们举一个输入和期望输出的例子吗？我想他指的是四个单词的短语。@Pablo:那么如何抓住这四个单词的短语呢To OP：你的意思是将列表

文件.read（）.split（）

分割成块吗？是的，@pablo.）我是说四个词的短语。。。在文件中按该顺序出现一次或多次的四个单词。最终，我希望能够为任意数量的单词短语创建输出。@DavidHoelzer，你能添加一个小的示例输入和预期输出来解释原因吗？这仍然不能使它变干。另外，我认为你实际上打破了四个单词短语的逻辑。@DavidHoelzer现在看一下吗？@DavidHoelzer这也与你提供的解决方案代码相匹配。到目前为止，这看起来不错。谢谢我将对代码进行评估，看看是否有比这更好的代码。我的向上投票至少可以让你回到“零”。：）你打开了一个没有上下文管理的文件，却忘记关闭它。正在运行的文件将是大量的。首先将整个文件读入内存看起来不是最优的。您也可以使用yield。我可以更新代码，如果这是唯一的problem@AliNikneshan您的意思是

f.readlines（）

-由于当前循环将在当前循环第一行中的每个字符，因此我没有进行向下投票，但逻辑被破坏。空格是必需的，在您的解决方案中不再维护。@UtsavShah为什么需要空格？他的原始代码中有空格作为键，所以它们类似于“word1 word2 word3”，这就是为什么。我猜这是一项自然语言处理任务。@UtsavShah那么只有单词顺序才重要，因此空格并不重要。@DavidHoelzer和顺便说一句，使用元组存储拆分单词可以减少内存消耗，因为这样Python可以重用引用的对象，而不是在内存中创建新的副本，因此，在最后一步返回空格会获得更好的性能。这与提供的输出不匹配。这也是非常低效的，调用list.count是获取计数的一种非常糟糕的方法。让我尝试一下其他方法，而不是稍等片刻。@PadraicCunningham那么这个方法比.count（）好吗？@Rockybilly，

（words_dict.keys（））

是python2中的

0（n）

操作，是使用python3进行的不必要的方法调用，您可以在words中使用

\u dict

或者更好，也可以使用计数器dict，只拆分字符串一次

[(('file',), 2),
 (('input', 'file'), 2),
 (('input',), 2),
 (('a', 'sample', 'input'), 1),
 (('file', 'will', 'always', 'be'), 1),
 (('sample', 'input', 'file', 'an'), 1),
 (('this', 'is', 'a', 'sample'), 1),
 (('this', 'is'), 1),
 (('will',), 1),
 (('lower', 'case', 'with'), 1),
 (('an', 'input', 'file', 'will'), 1),
 (('sample', 'input'), 1),
 (('is', 'a'), 1),
 (('all', 'lower', 'case', 'with'), 1),
 (('input', 'file', 'will'), 1),
 (('an',), 1),
 (('always', 'be'), 1),
 (('lower', 'case', 'with', 'no'), 1),
 (('an', 'input'), 1),
 (('be', 'all', 'lower'), 1),
 (('this',), 1),
 (('be', 'all', 'lower', 'case'), 1),
 (('this', 'is', 'a'), 1),
 (('sample',), 1),
 (('sample', 'input', 'file'), 1),
 (('will', 'always', 'be', 'all'), 1),
 (('a',), 1),
 (('a', 'sample'), 1),
 (('is', 'a', 'sample'), 1),
 (('will', 'always'), 1),
 (('lower',), 1),
 (('lower', 'case'), 1),
 (('file', 'an'), 1),
 (('file', 'an', 'input'), 1),
 (('file', 'will'), 1),
 (('is',), 1),
 (('all', 'lower'), 1),
 (('input', 'file', 'an', 'input'), 1),
 (('always', 'be', 'all', 'lower'), 1),
 (('an', 'input', 'file'), 1),
 (('input', 'file', 'an'), 1),
 (('be', 'all'), 1),
 (('input', 'file', 'will', 'always'), 1),
 (('be',), 1),
 (('all',), 1),
 (('always', 'be', 'all'), 1),
 (('is', 'a', 'sample', 'input'), 1),
 (('always',), 1),
 (('all', 'lower', 'case'), 1),
 (('file', 'an', 'input', 'file'), 1),
 (('file', 'will', 'always'), 1),
 (('a', 'sample', 'input', 'file'), 1),
 (('will', 'always', 'be'), 1)]