在python中批处理非常大的文本文件_Python_Bigdata_Batching

在python中批处理非常大的文本文件

python

在python中批处理非常大的文本文件,python,bigdata,batching,Python,Bigdata,Batching,我正在尝试将一个非常大的文本文件（大约150 GB）批处理成几个较小的文本文件（大约10 GB）我的一般程序将是： # iterate over file one line at a time # accumulate batch as string --> # given a certain count that correlates to the size of my current accumulated batch and when that size is met: (thi

我正在尝试将一个非常大的文本文件（大约150 GB）批处理成几个较小的文本文件（大约10 GB）

我的一般程序将是：

# iterate over file one line at a time
# accumulate batch as string 
--> # given a certain count that correlates to the size of my current accumulated batch and when that size is met: (this is where I am unsure)
        # write to file

# accumulate size count

我有一个粗略的指标来计算何时批处理（当所需的批处理大小时），但不太清楚如何计算给定批处理写入磁盘的频率。例如，如果我的批处理大小是10G，我假设我需要迭代写入，而不是将整个10G的批处理保存在内存中。显然，我不想写得太多，因为这可能会相当昂贵

您是否有一些粗略的计算或技巧，可以用来计算何时将数据写入磁盘以完成这类任务，例如大小与内存等？

以下是逐行写入的示例。它是以二进制模式打开的，以避免行解码步骤，该步骤需要适度的时间，但会扭曲字符计数。例如，utf-8编码可能会对单个python字符使用磁盘上的多个字节

4梅格是缓冲的猜测。这样做的目的是让操作系统一次读取更多的文件，减少查找时间。这是否有效或者使用的最佳数字是有争议的，并且对于不同的操作系统会有所不同。我发现4个梅格就不同了。。。但那是几年前的事了，事情变了

outfile_template = "outfile-{}.txt"
infile_name = "infile.txt"
chunksize = 10_000_000_000
MEB = 2**20   # mebibyte

count = 0
byteswritten = 0
infile = open(infile_name, "rb", buffering=4*MEB)
outfile = open(outfile_template.format(count), "wb", buffering=4*MEB)

try:
    for line in infile:
        if byteswritten > chunksize:
            outfile.close()
            byteswritten = 0
            count += 1
            outfile = open(outfile_template.format(count), "wb", buffering=4*MEB)
        outfile.write(line)
        byteswritten += len(line)
finally:
    infile.close()
    outfile.close()

我使用了这个稍加修改的版本来解析250GB json，我选择需要多少个较小的文件

切片的数量，然后找到切片文件的位置（我总是寻找行结尾）。最后，我用file.seek
和file.read（chunk）

导入操作系统
导入mmap
完整路径到文件='完整路径到大文件'
OUTPUT_PATH='full_PATH_to_a_OUTPUT_dir'#将在其中生成切片文件
def next_newline_查找器（mmapf）：
def nl_查找（mmapf）：
而1：
当前=十六进制（mmapf.read_byte（））
如果十六进制（ord（'\n'））==当前：#或任何行结束符号
返回（mmapf.tell（））
返回nl_查找（mmapf）
#查找要切片文件的位置
file\u info=os.stat（指向文件的完整路径）
file\u size=file\u info.st\u size
位置\u用于\u文件\u切片=[0]
number_of_slices=15#假设您希望将大文件切片为15个小文件
每片大小=文件大小//片数
打开（完整路径到文件“r+b”）作为f：
mmapf=mmap.mmap（f.fileno（），0，access=mmap.access\u READ）
切片计数器=1
当片数计数器<片数时：
pos=每个切片的大小*切片计数器
mmapf.seek（位置）
换行符位置=下一个换行符查找器（mmapf）
位置\u用于\u文件\u切片.append（换行\u位置）
切片计数器+=1
#为找到的位置创建范围（从、到）
如果i<（len（positions_for_file_slice）-1），则位置_for_file_slice=[（pos，positions_for_file_slice[i+1]）否则(
expndtw-1\u文件\u切片[i]的位置\u，i的文件\u大小），枚举中的位置（位置\u文件\u切片）]
#执行文件的实际切片
打开（完整路径到文件“rb”）作为f：
对于i，在枚举中定位\u对（定位\u文件\u切片）：
读自，读至=位置对
f、 查找（从中读取）
chunk=f.read（从读到读）
将open（os.path.join（OUTPUT_path，f'dummyfile{i}.json'），'wb'）作为块文件：
chunk\u file.write（chunk）
假设您的大文件是简单的非结构化文本，也就是说，这对JSON这样的结构化文本不好，这里有一个替代方法来读取每一行：读取输入文件的大二进制位，直到达到您的chunksize，然后读取几行，关闭当前输出文件并继续下一行
我将其与使用@tdelaney代码逐行进行比较，该代码采用了与我的代码相同的chunksize——该代码花费250秒将12GiB输入文件拆分为6x2GiB块，而这花费了约50秒，因此可能快了五倍，看起来它在我的SSD上的I/O绑定速度大于200Mb/s读写，其中逐行运行40-50Mb/s读写
我关闭了缓冲，因为没有太多的意义。咬的大小和缓冲设置可能可以调整以提高性能，但我还没有尝试过任何其他设置，因为它似乎是I/O绑定的
import time

outfile_template = "outfile-{}.txt"
infile_name = "large.text"
chunksize = 2_000_000_000
MEB = 2**20   # mebibyte
bitesize = 4_000_000 # the size of the reads (and writes) working up to chunksize

count = 0

starttime = time.perf_counter()

infile = open(infile_name, "rb", buffering=0)
outfile = open(outfile_template.format(count), "wb", buffering=0)

while True:
    byteswritten = 0
    while byteswritten < chunksize:
        bite = infile.read(bitesize)
        # check for EOF
        if not bite:
            break
        outfile.write(bite)
        byteswritten += len(bite)
    # check for EOF
    if not bite:
        break
    for i in range(2):
        l = infile.readline()
        # check for EOF
        if not l:
            break
        outfile.write(l)
    # check for EOF
    if not l:
        break
    outfile.close()
    count += 1
    print( count )
    outfile = open(outfile_template.format(count), "wb", buffering=0)

outfile.close()
infile.close()

endtime = time.perf_counter()

elapsed = endtime-starttime

print( f"Elapsed= {elapsed}" )

导入时间
outfile_template=“outfile-{}.txt”
infle_name=“large.text”
chunksize=2_000_000_000
MEB=2**20兆字节
bitesize=4_000_000#读（和写）操作到chunksize的大小
计数=0
starttime=time.perf_计数器（）
infle=open（infle_名称，“rb”，缓冲=0）
outfile=open（outfile_模板格式（计数），“wb”，缓冲=0）
尽管如此：
字节写入=0
BytesWrite

请注意，我并没有详尽地测试这不会丢失数据，尽管没有证据表明它确实丢失了任何东西，您应该自己验证它
通过检查数据块末尾的时间，查看剩余的数据量，从而确保最后一个输出文件的长度不超过0（或小于bitesize），可以增加一些健壮性
嗯
巴尼
为什么要麻烦缓冲？你不能一次只写一行吗？写每行不是非常昂贵吗？它会让操作系统计算出来。你可以用一个更大的缓冲参数（比如4兆）以二进制打开。然后逐行读/写。你会有很好的表现
import time

outfile_template = "outfile-{}.txt"
infile_name = "large.text"
chunksize = 2_000_000_000
MEB = 2**20   # mebibyte
bitesize = 4_000_000 # the size of the reads (and writes) working up to chunksize

count = 0

starttime = time.perf_counter()

infile = open(infile_name, "rb", buffering=0)
outfile = open(outfile_template.format(count), "wb", buffering=0)

while True:
    byteswritten = 0
    while byteswritten < chunksize:
        bite = infile.read(bitesize)
        # check for EOF
        if not bite:
            break
        outfile.write(bite)
        byteswritten += len(bite)
    # check for EOF
    if not bite:
        break
    for i in range(2):
        l = infile.readline()
        # check for EOF
        if not l:
            break
        outfile.write(l)
    # check for EOF
    if not l:
        break
    outfile.close()
    count += 1
    print( count )
    outfile = open(outfile_template.format(count), "wb", buffering=0)

outfile.close()
infile.close()

endtime = time.perf_counter()

elapsed = endtime-starttime

print( f"Elapsed= {elapsed}" )