python使用多个键填充搁置对象/字典

python使用多个键填充搁置对象/字典,python,dictionary,n-gram,shelve,Python,Dictionary,N Gram,Shelve,我有一个4克的列表,我想用它填充dictionary对象/shevle对象: ['I','go','to','work'] ['I','go','there','often'] ['it','is','nice','being'] ['I','live','in','NY'] ['I','go','to','work'] 这样我们就有了类似于: four_grams['I']['go']['to']['work']=1 任何新遇到的4-gram都会用它的四个键填充,值为1,如果再次遇到,它的

我有一个4克的列表,我想用它填充dictionary对象/shevle对象:

['I','go','to','work']
['I','go','there','often']
['it','is','nice','being']
['I','live','in','NY']
['I','go','to','work']
这样我们就有了类似于:

four_grams['I']['go']['to']['work']=1

任何新遇到的4-gram都会用它的四个键填充,值为1,如果再次遇到,它的值会递增

您只需创建一个助手方法,将元素一次插入一个嵌套字典,每次检查所需子字典是否已存在:

dict = {}
def insert(fourgram):
    d = dict    # reference
    for el in fourgram[0:-1]:       # elements 1-3 if fourgram has 4 elements
        if el not in d: d[el] = {}  # create new, empty dict
        d = d[el]                   # move into next level dict

    if fourgram[-1] in d: d[fourgram[-1]] += 1  # increment existing, or...
    else: d[fourgram[-1]] = 1                   # ...create as 1 first time
您可以使用数据集对其进行填充,如:

insert(['I','go','to','work'])
insert(['I','go','there','often'])
insert(['it','is','nice','being'])
insert(['I','live','in','NY'])
insert(['I','go','to','work'])
之后,您可以根据需要索引到
dict

print( dict['I']['go']['to']['work'] );     # prints 2
print( dict['I']['go']['there']['often'] ); # prints 1
print( dict['it']['is']['nice']['being'] ); # prints 1
print( dict['I']['live']['in']['NY'] );     # prints 1

你可以这样做:

import shelve

from collections import defaultdict

db = shelve.open('/tmp/db')

grams = [
    ['I','go','to','work'],
    ['I','go','there','often'],
    ['it','is','nice','being'],
    ['I','live','in','NY'],
    ['I','go','to','work'],
]

for gram in grams:
    path = db.get(gram[0], defaultdict(int))

    def f(path, word):
        if not word in path:
            path[word] = defaultdict(int)
        return path[word]
    reduce(f, gram[1:-1], path)[gram[-1]] += 1

    db[gram[0]] = path

print db

db.close()

重复使用这个工具架对象可以吗?而且它不适用于多个级别,只有两个。。。这当然很有用,但完全不同,请注意,如果删除重复标记,可以通过对
Shelve
对象的
\uuuu getitem\uuuu
子类化,在
KeyError
上添加
defaultdict
对象来轻松实现。您也可以使用4个长元组。这听起来很有趣,但我该怎么做?对于解决方案来说似乎也不错,但我如何使用它来更新搁置对象?它需要是搁置吗?您可以将字典pickle/转储到json中,然后自己将其保存到文件中吗?是的,因为我无法在每次运行此代码(我将其用于非常大的数据集)或从文件中写入和读取pickle文件,所以搁置(无写回)是一个非常好的解决方案,重点是如何使它与更新多个键一起工作(我认为使用一些临时变量是可能的,但仍然无法准确地找出如何做到这一点)好的,我已经更新了我的答案。我希望这足以让您开始。是的,您可以将
dict
初始化为
dict=shelve.open('file',writeback=True)
,这样就可以了。是的,writeback=True的问题是,如果数据集很大(这里就是这种情况),我们将遇到内存问题,所以我希望避免这种情况