Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/file/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Python按行号将大文本文件拆分为小文本文件_Python_File_Split_Lines - Fatal编程技术网

使用Python按行号将大文本文件拆分为小文本文件

使用Python按行号将大文本文件拆分为小文本文件,python,file,split,lines,Python,File,Split,Lines,我有一个文本文件,名为really_big_file.txt,其中包含: line 1 line 2 line 3 line 4 ... line 99999 line 100000 我想编写一个Python脚本,将真正的大文件.txt分割成更小的文件,每个文件有300行。例如,small_file_300.txt包含第1-300行,small_file_600包含第301-600行,依此类推,直到生成足够的小文件以包含大文件中的所有行 如果您能就使用Python实现这一目标的最简单方法提出建

我有一个文本文件,名为really_big_file.txt,其中包含:

line 1
line 2
line 3
line 4
...
line 99999
line 100000
我想编写一个Python脚本,将真正的大文件.txt分割成更小的文件,每个文件有300行。例如,small_file_300.txt包含第1-300行,small_file_600包含第301-600行,依此类推,直到生成足够的小文件以包含大文件中的所有行

如果您能就使用Python实现这一目标的最简单方法提出建议,我将不胜感激

与将每一行存储在一个列表中不同,这种方法的优点是它可以逐行处理iterables,因此它不必一次将每个
小文件
存储到内存中

请注意,在这种情况下,最后一个文件将是
小文件\u 100200
,但仅在
行100000
之前。这是因为
fillvalue='
,这意味着当我没有更多的行要写时,我不会向文件中写入任何内容,因为组大小不相等。您可以通过写入临时文件,然后在之后重命名它来修复此问题,而不是像我一样先命名它。以下是如何做到这一点

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

这一次,
fillvalue=None
我检查了每一行的
None
,当它发生时,我知道过程已经结束,所以我从
j
中减去
1
,不计算填充,然后写入文件

我用一种更容易理解的方式来做这件事,并少走捷径,以便让您进一步了解这件事的原理和原因。前面的答案是可行的,但是如果您不熟悉某些内置函数,您将无法理解该函数正在做什么

lines_per_file = 300  # Lines on each small file
lines = []  # Stores lines not yet written on a small file
lines_counter = 0  # Same as len(lines)
created_files = 0  # Counting how many small files have been created

with open('really_big_file.txt') as big_file:
    for line in big_file:  # Go throught the whole big file
        lines.append(line)
        lines_counter += 1
        if lines_counter == lines_per_file:
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write all lines on small file
                small_file.write('\n'.join(stored_lines))
            lines = []  # Reset variables
            lines_counter = 0
            created_files += 1  # One more small file has been created
    # After for-loop has finished
    if lines_counter:  # There are still some lines not written on a file?
        idx = lines_per_file * (created_files + 1)
        with open('small_file_%s.txt' % idx, 'w') as small_file:
            # Write them on a last small file
            small_file.write('n'.join(stored_lines))
        created_files += 1

print '%s small files (with %s lines each) were created.' % (created_files,
                                                             lines_per_file)
因为您没有发布任何代码,所以我决定这样做,因为您可能不熟悉基本python语法以外的内容,因为您对问题的措辞方式使您看起来似乎既没有尝试,也没有任何关于如何处理问题的线索

以下是在基本python中执行此操作的步骤:

首先,您应该将您的文件读入一个列表,以便妥善保管:

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
其次,您需要设置一种按名称创建新文件的方法!我建议使用一个循环和几个计数器:

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
第三,在该循环中,需要一些嵌套循环,以便将正确的行保存到数组中:

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
注意:如果行数不能被300整除,则最后一个文件的名称将与最后一个文件行不对应

理解这些循环为什么工作很重要。您将其设置为,在下一个循环中,您写入的文件的名称会发生更改,因为您的名称依赖于不断变化的变量。这是一个非常有用的脚本工具,用于文件访问、打开、写入、组织等

如果您无法遵循what循环中的内容,以下是函数的全部内容:

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)
my_file='really_big_file.txt'
排序=真
保留行=[]
打开(my_文件,'r')作为文本_文件:
对于文本文件中的行:
保留行。追加(行)
外部计数=1
行计数=0
排序时:
计数=0
增量=(外部计数-1)*300
左=长度(保持线)-增量
file_name=“small_file”+str(外部计数*300)+“.txt”
保留新的行=[]
如果左侧小于300:
当计数<左:
保留新的行。追加(保留行[行数])
计数+=1
行计数+=1
排序=假
其他:
当计数小于300时:
保留新的行。追加(保留行[行数])
计数+=1
行计数+=1
外部计数+=1
打开(文件名,'w')作为下一个文件:
对于保留\新\行中的行:
下一个文件。写入(行)

我必须对650000行文件执行同样的操作

使用枚举索引和整数div it(//)和块大小

当该数字更改时,关闭当前文件并打开一个新文件

这是一个使用格式字符串的python3解决方案

chunk = 50000  # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')

with open('massive_web_log_file') as file_to_read:
    for i, line in enumerate(file_to_read.readlines()):
        file_name = f'./a_folder/{i // chunk}'
        print(i, file_name)  # a bit of feedback that slows the process down a

        if file_name == this_small_file.name:
            this_small_file.write(line)

        else:
            this_small_file.write(line)
            this_small_file.close()
            this_small_file = open(f'{file_name}', 'a')

文件设置为要将主文件拆分到的文件数 在我的示例中,我想从主文件中获取10个文件

files = 10
with open("data.txt","r") as data :
    emails = data.readlines()
    batchs = int(len(emails)/10)
    for id,log in enumerate(emails):
        fileid = id/batchs
        file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
        file.write(log)

唯一的问题是,在使用此方法写入每个
小文件之前,必须立即将其存储在内存中,这可能是个问题,也可能不是问题。当然,你可以通过将其链接成一行一行地写入文件来解决这个问题。嘿,快问,你介意解释一下为什么要使用quotechar='\'”,谢谢,因为我的例子中有一个不同的引号字符(|)。你可以跳过设置,因为对于关心速度的人来说,默认引号字符是(引号),一个包含98500条记录(大小约13MB)的CSV文件用此代码在大约2.31秒内拆分。我得说那很好。非常好@Ryan Saxe!如果您使用的是python 3.x中的第一个脚本,请将
izip_longest
替换为新的
zip_longest
@YuvalPruss I,它是根据您的评论更新的,现在Py3是标准的,您可以通过评论
print(I,文件名)
lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()
import csv
import os
import re

MAX_CHUNKS = 300


def writeRow(idr, row):
    with open("file_%d.csv" % idr, 'ab') as file:
        writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
        writer.writerow(row)

def cleanup():
    for f in os.listdir("."):
        if re.search("file_.*", f):
            os.remove(os.path.join(".", f))

def main():
    cleanup()
    with open("large_file.csv", 'rb') as results:
        r = csv.reader(results, delimiter=',', quotechar='\"')
        idr = 1
        for i, x in enumerate(r):
            temp = i + 1
            if not (temp % (MAX_CHUNKS + 1)):
                idr += 1
            writeRow(idr, x)

if __name__ == "__main__": main()
chunk = 50000  # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')

with open('massive_web_log_file') as file_to_read:
    for i, line in enumerate(file_to_read.readlines()):
        file_name = f'./a_folder/{i // chunk}'
        print(i, file_name)  # a bit of feedback that slows the process down a

        if file_name == this_small_file.name:
            this_small_file.write(line)

        else:
            this_small_file.write(line)
            this_small_file.close()
            this_small_file = open(f'{file_name}', 'a')
files = 10
with open("data.txt","r") as data :
    emails = data.readlines()
    batchs = int(len(emails)/10)
    for id,log in enumerate(emails):
        fileid = id/batchs
        file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
        file.write(log)