使用python解析大型（9GB）文件_Python_Python 3.x_Large Files

使用python解析大型（9GB）文件

python python-3.x

使用python解析大型（9GB）文件,python,python-3.x,large-files,Python,Python 3.x,Large Files,我有一个大文本文件，需要使用python将其解析为管道分隔的文本文件。文件如下所示（基本上）：每条记录由两个换行符分隔/n。我在下面编写了一个解析器 with open ("largefile.txt", "r") as myfile: fullstr = myfile.read() allsplits = re.split("\n\n",fullstr) articles = [] for i,s in enumerate(allsplits[0:]): sp

我有一个大文本文件，需要使用python将其解析为管道分隔的文本文件。文件如下所示（基本上）：

每条记录由两个换行符分隔

/n

。我在下面编写了一个解析器

with open ("largefile.txt", "r") as myfile:
    fullstr = myfile.read()

allsplits = re.split("\n\n",fullstr)

articles = []

for i,s in enumerate(allsplits[0:]):

        splits = re.split("\n.*?: ",s)
        productId = splits[0]
        userId = splits[1]
        profileName = splits[2]
        helpfulness = splits[3]
        rating = splits[4]
        time = splits[5]
        summary = splits[6]
        text = splits[7]

fw = open(outnamename,'w')
fw.write(productId+"|"+userID+"|"+profileName+"|"+helpfulness+"|"+rating+"|"+time+"|"+summary+"|"+text+"\n")

return

问题是我正在读取的文件太大，在完成之前我的内存就用完了
我怀疑它是在

allsplits=re.split（“\n\n”，fullstr）

行出现的
有人能告诉我一次读取一条记录，对其进行解析，将其写入文件，然后移动到下一条记录吗？

使用“readline（）”逐个读取记录的字段。或者您可以使用read（n）来读取“n”字节。

不要一次将整个文件读入内存，而是逐行迭代，还可以使用Python解析记录：

import csv

with open('hugeinputfile.txt', 'rb') as infile, open('outputfile.txt', 'wb') as outfile:

    writer = csv.writer(outfile, delimiter='|')

    for record in csv.reader(infile, delimiter='\n', lineterminator='\n\n'):
        values = [item.split(':')[-1].strip() for item in record[:-1]] + [record[-1]]
        writer.writerow(values)

这里有几点需要注意：

使用
和
打开文件。为什么？因为将
```
与
```
一起使用可以确保文件是
```
close（）
```
d，即使异常会中断脚本

因此：

相当于：

f = open('myfile.txt')
try:
    do_stuff_to_file(f)
finally:
    f.close()

待续。。。（我的ATM机没时间了）

不要一次就把整个文件读入内存；利用这些新行生成记录。使用编写数据以便于写出管道分隔的记录

下面的代码一次读取输入文件行，并在执行过程中写出每条记录的CSV行。它在内存中的存储量永远不会超过一行，加上正在构造的一条记录

import csv
import re

fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')

with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
    writer = csv.DictWriter(fw, fields, delimiter='|')

    record = {}
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = {}
            continue

        field, value = line.split(': ', 1)
        record[field.partition('/')[-1].strip()] = value.strip()

    if record:
        # handle last record
        writer.writerow(record)

此代码假定文件包含的文本位于冒号前面，冒号形式为

类别/键

，因此

产品/productId

，

查看/userId

，等等。斜杠后面的部分用于CSV列；顶部的

字段

列表反映了这些键

或者，您可以删除该

字段

列表，改为使用

csv.writer

在列表中收集记录值：

import csv
import re

with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
    writer = csv.writer(fw, delimiter='|')

    record = []
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = []
            continue

        field, value = line.split(': ', 1)
        record.append(value.strip())

    if record:
        # handle last record
        writer.writerow(record)

此版本要求所有记录字段都存在，并按固定顺序写入文件。

这看起来像是为

sed

制作的。数据前是否总是有冒号？你的代码让我这么想，但你最后的条目没有。最后一个条目（文本）能包含多行吗？这不会分割记录键；你在写

product/productId:D7SDF9S9

而不是

D7SDF9S9

@MartijnPieters:啊，你说得对！我忽略了那部分。嘿，谢谢！这看起来不错。当我使用这个方法时，我得到了这个错误：“csv.writerow（record）；AttributeError:'module'对象没有属性'writerow'”“你知道我的问题是什么吗？@user2896837:我犯了一个愚蠢的错误；更正，它是

writer.writerow（）

。现在我得到了：“文件”//Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/csv.py”，第153行，在writerow中返回self.writer.writerow（self.\u dict\u to\u list（rowdict））类型错误：“str”不支持缓冲区接口“啊，这是Python 3；调整了为您打开输出文件的方式。抱歉，有这么多问题。我也很讨厌自己。我遇到了这个错误：“文件”//Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/ascii.py”，第26行，在解码返回codecs.ascii_decode（输入，self.errors）[0]UnicodeDecodeError:“ascii”编解码器在我更改输出文件打开方式后，无法解码位置4146:ordinal not in range（128）”的字节0xf8。我也尝试过使用“.encode（'utf-8'）”对字符串进行编码，但还没有成功。再次感谢您的帮助和耐心。

import csv
import re

fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')

with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
    writer = csv.DictWriter(fw, fields, delimiter='|')

    record = {}
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = {}
            continue

        field, value = line.split(': ', 1)
        record[field.partition('/')[-1].strip()] = value.strip()

    if record:
        # handle last record
        writer.writerow(record)

import csv
import re

with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
    writer = csv.writer(fw, delimiter='|')

    record = []
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = []
            continue

        field, value = line.split(': ', 1)
        record.append(value.strip())

    if record:
        # handle last record
        writer.writerow(record)