使用Python按行号将大文本文件拆分为小文本文件_Python_File_Split_Lines

使用Python按行号将大文本文件拆分为小文本文件

python file

使用Python按行号将大文本文件拆分为小文本文件,python,file,split,lines,Python,File,Split,Lines,我有一个文本文件，名为really_big_file.txt，其中包含： line 1 line 2 line 3 line 4 ... line 99999 line 100000 我想编写一个Python脚本，将真正的大文件.txt分割成更小的文件，每个文件有300行。例如，small_file_300.txt包含第1-300行，small_file_600包含第301-600行，依此类推，直到生成足够的小文件以包含大文件中的所有行如果您能就使用Python实现这一目标的最简单方法提出建

我有一个文本文件，名为really_big_file.txt，其中包含：

line 1
line 2
line 3
line 4
...
line 99999
line 100000

我想编写一个Python脚本，将真正的大文件.txt分割成更小的文件，每个文件有300行。例如，small_file_300.txt包含第1-300行，small_file_600包含第301-600行，依此类推，直到生成足够的小文件以包含大文件中的所有行

如果您能就使用Python实现这一目标的最简单方法提出建议，我将不胜感激

与将每一行存储在一个列表中不同，这种方法的优点是它可以逐行处理iterables，因此它不必一次将每个

小文件

存储到内存中

请注意，在这种情况下，最后一个文件将是

小文件\u 100200

，但仅在

行100000

之前。这是因为

fillvalue='

，这意味着当我没有更多的行要写时，我不会向文件中写入任何内容，因为组大小不相等。您可以通过写入临时文件，然后在之后重命名它来修复此问题，而不是像我一样先命名它。以下是如何做到这一点

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

这一次，

fillvalue=None

我检查了每一行的

None

，当它发生时，我知道过程已经结束，所以我从

中减去

，不计算填充，然后写入文件

我用一种更容易理解的方式来做这件事，并少走捷径，以便让您进一步了解这件事的原理和原因。前面的答案是可行的，但是如果您不熟悉某些内置函数，您将无法理解该函数正在做什么

lines_per_file = 300  # Lines on each small file
lines = []  # Stores lines not yet written on a small file
lines_counter = 0  # Same as len(lines)
created_files = 0  # Counting how many small files have been created

with open('really_big_file.txt') as big_file:
    for line in big_file:  # Go throught the whole big file
        lines.append(line)
        lines_counter += 1
        if lines_counter == lines_per_file:
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write all lines on small file
                small_file.write('\n'.join(stored_lines))
            lines = []  # Reset variables
            lines_counter = 0
            created_files += 1  # One more small file has been created
    # After for-loop has finished
    if lines_counter:  # There are still some lines not written on a file?
        idx = lines_per_file * (created_files + 1)
        with open('small_file_%s.txt' % idx, 'w') as small_file:
            # Write them on a last small file
            small_file.write('n'.join(stored_lines))
        created_files += 1

print '%s small files (with %s lines each) were created.' % (created_files,
                                                             lines_per_file)

因为您没有发布任何代码，所以我决定这样做，因为您可能不熟悉基本python语法以外的内容，因为您对问题的措辞方式使您看起来似乎既没有尝试，也没有任何关于如何处理问题的线索

以下是在基本python中执行此操作的步骤：

首先，您应该将您的文件读入一个列表，以便妥善保管：

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)

其次，您需要设置一种按名称创建新文件的方法！我建议使用一个循环和几个计数器：

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"

第三，在该循环中，需要一些嵌套循环，以便将正确的行保存到数组中：

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1

注意：如果行数不能被300整除，则最后一个文件的名称将与最后一个文件行不对应

理解这些循环为什么工作很重要。您将其设置为，在下一个循环中，您写入的文件的名称会发生更改，因为您的名称依赖于不断变化的变量。这是一个非常有用的脚本工具，用于文件访问、打开、写入、组织等

如果您无法遵循what循环中的内容，以下是函数的全部内容：

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)

my_file='really_big_file.txt'
排序=真
保留行=[]
打开（my_文件，'r'）作为文本_文件：
对于文本文件中的行：
保留行。追加（行）
外部计数=1
行计数=0
排序时：
计数=0
增量=（外部计数-1）*300
左=长度（保持线）-增量
file_name=“small_file”+str（外部计数*300）+“.txt”
保留新的行=[]
如果左侧小于300：
当计数<左：
保留新的行。追加（保留行[行数]）
计数+=1
行计数+=1
排序=假
其他：
当计数小于300时：
保留新的行。追加（保留行[行数]）
计数+=1
行计数+=1
外部计数+=1
打开（文件名，'w'）作为下一个文件：
对于保留\新\行中的行：
下一个文件。写入（行）

我必须对650000行文件执行同样的操作

使用枚举索引和整数div it（//）和块大小

当该数字更改时，关闭当前文件并打开一个新文件

这是一个使用格式字符串的python3解决方案

chunk = 50000  # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')

with open('massive_web_log_file') as file_to_read:
    for i, line in enumerate(file_to_read.readlines()):
        file_name = f'./a_folder/{i // chunk}'
        print(i, file_name)  # a bit of feedback that slows the process down a

        if file_name == this_small_file.name:
            this_small_file.write(line)

        else:
            this_small_file.write(line)
            this_small_file.close()
            this_small_file = open(f'{file_name}', 'a')

将文件设置为要将主文件拆分到的文件数在我的示例中，我想从主文件中获取10个文件

files = 10 with open("data.txt","r") as data : emails = data.readlines() batchs = int(len(emails)/10) for id,log in enumerate(emails): fileid = id/batchs file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+') file.write(log)

唯一的问题是，在使用此方法写入每个
小文件之前，必须立即将其存储在内存中，这可能是个问题，也可能不是问题。当然，你可以通过将其链接成一行一行地写入文件来解决这个问题。嘿，快问，你介意解释一下为什么要使用quotechar='\'”，谢谢，因为我的例子中有一个不同的引号字符（|）。你可以跳过设置，因为对于关心速度的人来说，默认引号字符是（引号），一个包含98500条记录（大小约13MB）的CSV文件用此代码在大约2.31秒内拆分。我得说那很好。非常好@Ryan Saxe！如果您使用的是python 3.x中的第一个脚本，请将izip_longest 替换为新的zip_longest @YuvalPruss I，它是根据您的评论更新的，现在Py3是标准的，您可以通过评论print（I，文件名） lines_per_file = 300 smallfile = None with open('really_big_file.txt') as bigfile: for lineno, line in enumerate(bigfile): if lineno % lines_per_file == 0: if smallfile: smallfile.close() small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file) smallfile = open(small_filename, "w") smallfile.write(line) if smallfile: smallfile.close() import csv import os import re MAX_CHUNKS = 300 def writeRow(idr, row): with open("file_%d.csv" % idr, 'ab') as file: writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL) writer.writerow(row) def cleanup(): for f in os.listdir("."): if re.search("file_.*", f): os.remove(os.path.join(".", f)) def main(): cleanup() with open("large_file.csv", 'rb') as results: r = csv.reader(results, delimiter=',', quotechar='\"') idr = 1 for i, x in enumerate(r): temp = i + 1 if not (temp % (MAX_CHUNKS + 1)): idr += 1 writeRow(idr, x) if __name__ == "__main__": main() chunk = 50000 # number of lines from the big file to put in small file this_small_file = open('./a_folder/0', 'a') with open('massive_web_log_file') as file_to_read: for i, line in enumerate(file_to_read.readlines()): file_name = f'./a_folder/{i // chunk}' print(i, file_name) # a bit of feedback that slows the process down a if file_name == this_small_file.name: this_small_file.write(line) else: this_small_file.write(line) this_small_file.close() this_small_file = open(f'{file_name}', 'a') files = 10 with open("data.txt","r") as data : emails = data.readlines() batchs = int(len(emails)/10) for id,log in enumerate(emails): fileid = id/batchs file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+') file.write(log)