如何在Python中迭代defaultdict（列表）？_Python_Loops_Dictionary_Iterator_Defaultdict

如何在Python中迭代defaultdict（列表）？

python loops dictionary

如何在Python中迭代defaultdict（列表）？,python,loops,dictionary,iterator,defaultdict,Python,Loops,Dictionary,Iterator,Defaultdict,如何在Python中迭代defaultdict（列表）？有没有更好的办法在Python中创建列表字典？我尝试了正常的iter（dict），但我得到了错误： >>> import para >>> para.print_doc('./sentseg_en/essentials.txt') Traceback (most recent call last): File "<stdin>", line 1, in <module>

如何在Python中迭代defaultdict（列表）？有没有更好的办法在Python中创建列表字典？我尝试了正常的

iter（dict）

，但我得到了错误：

>>> import para
>>> para.print_doc('./sentseg_en/essentials.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "para.py", line 31, in print_doc
    for para in iter(doc):
TypeError: iteration over non-sequence

pyc段：

# -*- coding: utf-8 -*-
## Modified paragraph into a defaultdict(list) structure
## Original code from http://code.activestate.com/recipes/66063/
from collections import defaultdict
class Paragraphs:
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    # Separator here refers to the paragraph seperator,
    #  the default separator is '\n'.
    def __init__(self, filename, separator=None):
        # Set separator if passed into object's parameter,
        #  else set default separator as '\n'
        if separator is None:
            def separator(line): return line == '\n'
        elif not callable(separator):
            raise TypeError, "separator argument must be callable"
        self.separator = separator
        # Reading lines from files into a dictionary of lists
        self.doc = defaultdict(list)
        paraIndex = 0
        with open(filename) as readFile:
            for line in readFile:
                if line == separator:
                    paraIndex+=1
                else:
                    self.doc[paraIndex].append(line)

# Prints out populated doc from txtfile
def print_doc(filename):
    text = Paragraphs(filename)
    for para in iter(text.doc):
        for sent in text.doc[para]:
            print "Para#%d, Sent#%d: %s" % (
                para, text.doc[para].index(sent), sent)

例如，

/foo/bar/para lines.txt

如下所示：

This is a start of a paragraph.
foo barr
bar foo
foo foo
This is the end.

This is the start of next para.
foo boo bar bar
this is the end.

Para#1,Sent#1: This is a start of a paragraph.
Para#1,Sent#2: foo barr
Para#1,Sent#3: bar foo
Para#1,Sent#4: foo foo
Para#1,Sent#5: This is the end.

Para#2,Sent#1: This is the start of next para.
Para#2,Sent#2: foo boo bar bar
Para#2,Sent#3: this is the end.

主类的输出应如下所示：

This is a start of a paragraph.
foo barr
bar foo
foo foo
This is the end.

This is the start of next para.
foo boo bar bar
this is the end.

Para#1,Sent#1: This is a start of a paragraph.
Para#1,Sent#2: foo barr
Para#1,Sent#3: bar foo
Para#1,Sent#4: foo foo
Para#1,Sent#5: This is the end.

Para#2,Sent#1: This is the start of next para.
Para#2,Sent#2: foo boo bar bar
Para#2,Sent#3: this is the end.

问题似乎是您正在迭代

段落

类，而不是字典。此外，不要重复使用键，然后访问字典条目，请考虑使用

for (key, value) in d.items():

它失败是因为您没有在段落类中定义

\uu iter\uu（）

，然后尝试调用

iter（doc）

（其中doc是段落实例）

类必须具有返回迭代器的

\uuuuu iter\uuuu（）

，才能进行iterable

你在线路上遇到的问题

for para in iter(doc):

那就是

doc

是段落的实例，而不是

defaultdict

。在

方法中使用的默认dict超出范围并丢失。所以你需要做两件事：
将在方法中创建的文档
保存为实例变量（self.doc
）

使段落
本身可编辑（通过添加\uuu iter\uuuu
方法），或允许其访问创建的文档
对象
我想不出你为什么在这里使用dict，更不用说defaultdict了。一份清单会简单得多
doc = []
with open(filename) as readFile:
    para = []
    for line in readFile:
        if line == separator:
            doc.append(para)
            para = []
        else:
            para.append(line)
    doc.append(para)

你链接到的食谱相当古老。它是在2001年编写的，当时Python还没有像（在Python2.4中引入）这样更现代的工具。下面是使用groupby
时代码的外观：
import itertools
import sys

with open('para-lines.txt', 'r') as f:
    paranum = 0
    for is_separator, paragraph in itertools.groupby(f, lambda line: line == '\n'):
        if is_separator:
            # we've reached paragraph separator
            print
        else:
            paranum += 1
            for n, sentence in enumerate(paragraph, start = 1):
                sys.stdout.write(
                    'Para#{i:d},Sent#{n:d}: {s}'.format(
                        i = paranum, n = n, s = sentence))

当我退出for
循环时，段落
会超出范围，我这样说对吗？如何保留段落并在itertools.groupby
循环之外继续访问它？不，名称段落
不在范围之外。Python不会为块构造打开新的作用域，例如with
和for
，而只是为函数打开新的作用域。段落
每次通过循环都会重新分配给一个新值。如果您希望保留旧段落，可以在循环外部定义一个列表段落=[]
，并在循环内部将每个段落附加到列表中：段落。附加（段落）
。我在您对Daniel Roseman的评论中读到文本文件很大。试图保存列表（或dict）中的所有段落可能会占用大量内存。您是否需要全部或仅需要前n段（使用deque）？您需要随机访问它们（使用dict）还是迭代访问它们（使用list或deque）？了解段落的用途将影响我们推荐使用的最佳算法/数据结构。我需要交叉引用并检查其他具有类似段落的文档中的相似性，并对齐段落内的句子。从技术上讲，我至少有150个文本文件，每个文件有3-10个段落，每个段落有5-6个句子，每个句子至少200-300个字符。我的字典里有150*7*5*250个字符。我想应该没问题。但是在测试代码之后，我需要将系统扩展到1500个文本，每个文本50段，我认为这会成为一个问题。我尝试将doc
和self.doc
保存在self.doc=defaultdict（list）
和self.doc[paraIndex].append（line）
中。但同样的超出范围的问题也会发生。@2er0:它在范围内，但作为doc.doc
（这意味着还有一个命名问题——您应该使用段落之类的内容，而不是打印文档中的doc
）。是的，感谢您注意到命名问题，在做了一些小的更改之后，迭代就可以工作了。但是，让我看看是否可以将self.doc
解决方案与unutbu的循环解决方案相结合。这是因为我的txt文件将是一个大txt文件，因此通过嵌套列表进行访问将占用大量时间。也许我需要一本字典。如果我想要一本词典，我该怎么办？这是怎么回事？为什么你认为一个嵌套的列表会比一个dict的dict花费更长的时间？