Python TypeError:只能将元组（而不是“str”）连接到元组_Python_Tuples

Python TypeError:只能将元组（而不是“str”）连接到元组

python

Python TypeError:只能将元组（而不是“str”）连接到元组,python,tuples,Python,Tuples,我有一个巨大的文本文件（1GB），其中每个“行”都符合以下语法： [number] [number]_[number] 例如： 123 123_1234 45 456_45 12 12_12 我得到以下错误： line 46, in open_delimited pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk, re.IGNORECASE) TypeError: can only concatenate

我有一个巨大的文本文件（1GB），其中每个“行”都符合以下语法：

[number] [number]_[number]

例如：

123 123_1234
45 456_45    12 12_12

我得到以下错误：

  line 46, in open_delimited
    pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk, re.IGNORECASE)
TypeError: can only concatenate tuple (not "str") to tuple

关于此代码：

def open_delimited(filename, args):
    with open(filename, args, encoding="UTF-16") as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk, re.IGNORECASE)
            for piece in pieces[:-1]:
                yield piece
            remainder = pieces[-1]
        if remainder:
            yield remainder

filename = 'data/AllData_2000001_3000000.txt'
for chunk in open_delimited(filename, 'r'): 
    print(chunk)

re.findall（）

在模式中给定多个捕获组时返回元组序列。您的模式有两个这样的组。因此，每个

工件

由

（编号、编号）

对组成：

>>> re.findall(r"(\d+)\s+(\d+_\d+)", '45 456_45    12 12_12')
[('45', '456_45'), ('12', '12_12')]

请注意，由于您只匹配空格和数字，因此

re.IGNORECASE

标志是完全冗余的

您将最后一个这样的

片段

分配给

余数

，然后循环并将其预先分配给

区块

，但这不起作用：

>>> ('12', '12_12') + '123 123_1234\n'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate tuple (not "str") to tuple

re.findall（）

在模式中给定多个捕获组时返回元组序列。您的模式有两个这样的组。因此，每个

工件

由

（编号、编号）

对组成：

>>> re.findall(r"(\d+)\s+(\d+_\d+)", '45 456_45    12 12_12')
[('45', '456_45'), ('12', '12_12')]

请注意，由于您只匹配空格和数字，因此

re.IGNORECASE

标志是完全冗余的

您将最后一个这样的

片段

分配给

余数

，然后循环并将其预先分配给

区块

，但这不起作用：

>>> ('12', '12_12') + '123 123_1234\n'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate tuple (not "str") to tuple

如前所述，处理过程中有一个bug：

findall（）

给出了元组列表

可以采用另一种方法

def open_delimited(filename, args):
    with open(filename, args, encoding="UTF-16") as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            remainder += chunk # add it to the to-be-processed string
            pieces = list(re.finditer(r"(\d+)\s+(\d+_\d+)", remainder, re.IGNORECASE))
            # Those pieces are match objects.
            for piece in pieces[:-1]: # omit the last one, as before
                yield piece.group() # the whole match
            remainder = remainder[pieces[-1].start()] # the last one tells us where to start again.
        if remainder:
            yield remainder

这里，

片段

不是字符串的元组，而是匹配对象。它们不仅告诉我们它们包含什么，还告诉我们它们来自哪里

这允许轻松地“重新创建”余数。

如前所述，处理过程中有一个错误：

findall（）

给出了元组列表

可以采用另一种方法

def open_delimited(filename, args):
    with open(filename, args, encoding="UTF-16") as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            remainder += chunk # add it to the to-be-processed string
            pieces = list(re.finditer(r"(\d+)\s+(\d+_\d+)", remainder, re.IGNORECASE))
            # Those pieces are match objects.
            for piece in pieces[:-1]: # omit the last one, as before
                yield piece.group() # the whole match
            remainder = remainder[pieces[-1].start()] # the last one tells us where to start again.
        if remainder:
            yield remainder

这里，

片段

不是字符串的元组，而是匹配对象。它们不仅告诉我们它们包含什么，还告诉我们它们来自哪里

这允许轻松地“重新创建”余数。

余数是循环中第二次迭代中的元组，而不是字符串。如果

块

是部分记录，则代码也会失败。在这种情况下，您没有匹配项。最好将其添加到

剩余部分

，拆分，然后在适当的地方尝试拆分。剩余部分是循环第二次迭代中的元组，而不是字符串如果

块

是部分记录，则代码也会失败。在这种情况下，您没有匹配项。最好将它添加到

余数中，拆分，然后在适当的地方尝试拆分它。