Python 如何按一定的大小和条件分块一个大文件

Python 如何按一定的大小和条件分块一个大文件,python,regex,python-3.x,filesplitting,Python,Regex,Python 3.x,Filesplitting,我有一个大的文本文件。我把那个文件分块成一定大小的小文件。以下是我得到的一个示例: import math import os numThread = 4 inputData= 'dir\example.txt' def chunk_files(): nline = sum(1 for line in open(inputData,'r', encoding='utf-8', errors='ignore')) chunk_size = math.floor(nline/in

我有一个大的文本文件。我把那个文件分块成一定大小的小文件。以下是我得到的一个示例:

import math
import os

numThread = 4
inputData= 'dir\example.txt'

def chunk_files():
    nline = sum(1 for line in open(inputData,'r', encoding='utf-8', errors='ignore'))
    chunk_size = math.floor(nline/int(numThread ))
    n_thread = int(numThread )
    j = 0
    with open(inputData,'r', encoding='utf-8', errors='ignore') as file_:
        for i, line in enumerate(file_):
            if (i + 1 == j * chunk_size and j != n_thread) or i == nline:
                out.close()
            if i + 1 == 1 or (j != n_thread and i + 1 == j * chunk_size):
                chunk_file = '_raw' + str(j) + '.txt'
                if os.path.isfile(chunk_file):
                    break
                out = open(chunk_file, 'w+', encoding='utf-8', errors='ignore')
                j = j + 1
            if out.closed != True:
                out.write(line)
            if i % 1000 == 0 and i != 0:
                print ('Processing line %i ...' % (i))
         print ('Done.')
这是文本文件中的文本示例:

190219 7:05:30 line3 success 
               line3 this is the 1st success process
               line3 this process need 3sec
200219 9:10:10 line2 success 
               line2 this is the 1st success process
由于区块的大小,我得到了各种形式的分割文本。像这样:

190219 7:05:30第3行成功
第3行这是第一个成功过程

line3此过程需要3秒
200219 9:10:10第2行成功
第2行这是第一个成功过程

我需要使用regex
reg=re.compile(r“\b(\d{6})(?=\s\d{1,}:\d{2}:\d{2})\b”)来拆分,后面是datetime
,如下所示:

190219 7:05:30第3行成功
第3行这是第一个成功的过程
第3行此过程需要3秒

200219 9:10:10第2行成功
第2行这是第一个成功过程

我试过了。但我似乎无法适应我的问题


有人能帮我把正则表达式放到chunk_文件函数中吗?提前感谢

因为我们的行数似乎不是静态的,我们也许可以得到我们的6位数字和日期,然后收集我们所有的行,然后我们将编写问题的其余部分,也许我们会对这个简单的表达式感兴趣:

(\d{6})\s(\d{1,}:\d{2}:\d{2})|\s*(.*)\s*
这里有我们的数字部分:

(\d{6})\s(\d{1,}:\d{2}:\d{2})
我们这里的台词是:

\s*(.*)\s*
试验 输出
我相信,让事情变得简单会有很大帮助

所有部分=[]
部分=[]
对于l.split('\n')中的行:
如果重新搜索(r“^\d+\s\d+:\d+:\d+\d+\s”,第行):
如果部分:
所有零件。附加(零件)
部分=[]
部分。追加(行)
其他:
所有零件。附加(零件)
打印(所有零件)
用你的测试试一试会得出以下结论:

In [37]: all_parts                                                                                                                                                                                
Out[37]: 
[['190219 7:05:30 line3 success ',
  '               line3 this is the 1st success process',
  '               line3 this process need 3sec'],
 ['200219 9:10:10 line2 success ',
  '               line2 this is the 1st success process'],
 ['190219 7:05:30 line3 success ',
  '               line3 this is the 1st success process',
  '               line3 this process need 3sec'],
 ['200219 9:10:10 line2 success ',
  '               line2 this is the 1st success process'],
 ['200219 9:10:10 line2 success ',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process']]
然后,您可以让代码返回一个生成器/迭代器,在这里您可以轻松地对任何大小的文件进行分块,并获得分块行的列表

Match 1 was found at 0-14: 190219 7:05:30
Group 1 found at 0-6: 190219
Group 2 found at 7-14: 7:05:30
Group 3 found at -1--1: None
Match 2 was found at 14-45:  line3 success 

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 15-29: line3 success 
Match 3 was found at 45-98: line3 this is the 1st success process

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 45-82: line3 this is the 1st success process
Match 4 was found at 98-127: line3 this process need 3sec

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 98-126: line3 this process need 3sec
Match 5 was found at 127-141: 200219 9:10:10
Group 1 found at 127-133: 200219
Group 2 found at 134-141: 9:10:10
Group 3 found at -1--1: None
Match 6 was found at 141-172:  line2 success 

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 142-156: line2 success 
Match 7 was found at 172-210: line2 this is the 1st success process

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 172-209: line2 this is the 1st success process
Match 8 was found at 210-224: 190219 7:05:30
Group 1 found at 210-216: 190219
Group 2 found at 217-224: 7:05:30
Group 3 found at -1--1: None
Match 9 was found at 224-255:  line3 success 

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 225-239: line3 success 
Match 10 was found at 255-308: line3 this is the 1st success process

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 255-292: line3 this is the 1st success process
Match 11 was found at 308-337: line3 this process need 3sec

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 308-336: line3 this process need 3sec
Match 12 was found at 337-351: 200219 9:10:10
Group 1 found at 337-343: 200219
Group 2 found at 344-351: 9:10:10
Group 3 found at -1--1: None
Match 13 was found at 351-382:  line2 success 

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 352-366: line2 success 
Match 14 was found at 382-420: line2 this is the 1st success process

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 382-419: line2 this is the 1st success process
Match 15 was found at 420-434: 200219 9:10:10
Group 1 found at 420-426: 200219
Group 2 found at 427-434: 9:10:10
Group 3 found at -1--1: None
Match 16 was found at 434-465:  line2 success 

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 435-449: line2 success 
Match 17 was found at 465-518: line2 this is the 1st success process

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 465-502: line2 this is the 1st success process
Match 18 was found at 518-571: line2 this is the 1st success process

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 518-555: line2 this is the 1st success process
Match 19 was found at 571-624: line2 this is the 1st success process

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 571-608: line2 this is the 1st success process
Match 20 was found at 624-677: line2 this is the 1st success process

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 624-661: line2 this is the 1st success process
Match 21 was found at 677-730: line2 this is the 1st success process

Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 677-714: line2 this is the 1st success process
Match 22 was found at 730-767: line2 this is the 1st success process
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 730-767: line2 this is the 1st success process
Match 23 was found at 767-767: 
Group 1 found at -1--1: None
Group 2 found at -1--1: None
Group 3 found at 767-767:
In [37]: all_parts                                                                                                                                                                                
Out[37]: 
[['190219 7:05:30 line3 success ',
  '               line3 this is the 1st success process',
  '               line3 this process need 3sec'],
 ['200219 9:10:10 line2 success ',
  '               line2 this is the 1st success process'],
 ['190219 7:05:30 line3 success ',
  '               line3 this is the 1st success process',
  '               line3 this process need 3sec'],
 ['200219 9:10:10 line2 success ',
  '               line2 this is the 1st success process'],
 ['200219 9:10:10 line2 success ',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process']]