使用多处理的flle处理-python
我是Python初学者,尝试添加几行代码将json转换为csv并返回json。有数千个文件(大小300 MB)要转换和处理。使用当前程序(使用1个CPU),我无法使用服务器的16个CPU,需要建议对程序进行微调以进行多处理。下面是我的python 3.7版本代码使用多处理的flle处理-python,python,Python,我是Python初学者,尝试添加几行代码将json转换为csv并返回json。有数千个文件(大小300 MB)要转换和处理。使用当前程序(使用1个CPU),我无法使用服务器的16个CPU,需要建议对程序进行微调以进行多处理。下面是我的python 3.7版本代码 import json import csv import os os.chdir('/stagingData/Scripts/test') for JsonFile in os.listdir(os.getcwd()):
import json
import csv
import os
os.chdir('/stagingData/Scripts/test')
for JsonFile in os.listdir(os.getcwd()):
PartialFileName = JsonFile.split('.')[0]
j = 1
with open(PartialFileName +".csv", 'w', newline='') as Output_File:
with open(JsonFile) as fileHandle:
i = 1
for Line in fileHandle:
try:
data = json.loads(Line, parse_float=str)
except:
print("Can't load line {}".format(i))
if i == 1:
header = data.keys()
output = csv.writer(Output_File)
output.writerow(header) #Writes header row
i += 1
output.writerow(data.values()) #writes values row
j += 1
感谢对多处理逻辑的建议因为您有许多文件,文档中最简单的多处理示例应该适合您
您也可以尝试将
listdir
替换为,这样在开始之前就不必返回所有目录项。如果您有一个大文件要更有效地处理,我建议如下:
import csv
import json
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor
source_big_file = Path('/path/to/file')
def chunk_file_by_line(source_filepath: Path, chunk_size: int = 10_000):
chunk_line_size = 10_000
intermediate_file_handlers = {}
last_chunk_filepath = None
with source_big_file.open('r', encoding='utf8') as big:
for line_number, line in big:
group = line_number - (line_number % chunk_line_size)
chunk_filename = f'{source_big_file.stem}.g{group}{source_big_file.suffix}'
chunk_filepath = source_big_file.parent / chunk_filename
if chunk_filepath not in intermediate_file_handlers:
file_handler = chuck_filepath.open('w', encoding='utf8')
intermediate_file_handlers[chunk_filepath] = file_handler
if last_chunk_filepath:
last_file_hanlder = intermediate_file_handlers[last_chunk_filepath]
last_file_handler.close()
yield last_chunk_filepath
else:
file_handler = intermediate_file_handlers[chunk_filepath]
file_handler.write(line)
last_chunk_filepath = chunk_filepath
# output last one
yield last_chunk_filepath
def json_to_csv(json_filepath: Path) -> Path:
csv_filename = f'{json_filepath.stem}.csv'
csv_filepath = json_filepath.parent / csv_filename
with csv_filepath.open('w', encoding='utf8') as csv_out, json_filepath.open('r', encoding='utf8') as json_in:
dwriter = csv.DictWriter(csv_out)
headers_written = False
for json_line in json_in:
data = json.loads(json_line)
if not headers_written:
# create header record
headers = {k:k for k in data.keys()}
dwriter.writeline(headers)
headers_written = True
dwriter.writeline(data)
return csv_filepath
with ProcessPoolExecutor() as pool:
futures = []
for chunk_filepath in chuck_file_by_line(source_big_file):
future = pool.submit(json_to_csv, chunk_filepath)
futures.append(future)
# wait for all to finish
for future in futures:
csv_filepath = future.result(timeout=None) # waits until complete
print(f'conversion complete> csv filepath: {csv_filepath}')
我建议阅读Python中的线程。或者Python3具有实际的协处理操作,这类似于较轻的线程。我回到了Python2.7,所以我不知道Python3中有什么很酷的新东西。我认为Python中的简单线程无论采用哪种方式都会更简单,因为您不希望一次运行太多线程(当然不是每个文件一个线程)。您希望每个文件运行一个线程。定义一个函数,该函数接受filename/fileobject并执行此处的操作,然后使用
Pool.map
或类似功能。我不知道json
模块在解析时是否会释放GIL,所以请尝试使用ThreadPoolExecutor
,如果您没有看到使用多个内核,请将其更改为ProcessPoolExecutor
。我尝试使用pandas数据帧,但性能非常差。例如,执行需要5分钟,而在当前代码中,它将在15秒内完成。为多处理增加了几行,但对输出感到失望。我能够用单CPU处理180秒,用16个CPU处理150秒。有更好的方法吗?文件转换是一个单独的函数&在这里传递参数。如果name_uuu==''uuuu main':worker_ucount=16个进程=[]用于范围内的i(worker_ucount):p=multiprocessing.Process(target=fileConversion,args=('/stagingData/Scripts/test'))processs.append(p)p.start()用于进程中的进程:Process.join()这可能是因为大部分时间都花在读取和写入存储器上,只有在转换任务占用CPU较多的情况下,您才会看到显著的加速。在运行程序时,我查看了CPU使用率,显示所有16个CPU的使用率都超过95%。所以,这个程序是CPU密集型的。线程限制可能会增加处理多个线程的位复杂性。处理单个文件/处理器的任何其他建议。由于这些文件将在下游处理中进一步处理,因此希望避免错误。感谢您的反馈。谢谢。是否要合并回单个文件?否。每个文件将分别进行处理和转换。拥有数千个文件,希望将负载分布到16个CPU上,并执行了几次。尝试了几次执行。无需多处理—执行在85-90秒内完成,其中11个文件(3 GB)通过多处理进行转换—需要更多时间。在150秒内完成不确定,发生了什么?
import csv
import json
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor
source_big_file = Path('/path/to/file')
def chunk_file_by_line(source_filepath: Path, chunk_size: int = 10_000):
chunk_line_size = 10_000
intermediate_file_handlers = {}
last_chunk_filepath = None
with source_big_file.open('r', encoding='utf8') as big:
for line_number, line in big:
group = line_number - (line_number % chunk_line_size)
chunk_filename = f'{source_big_file.stem}.g{group}{source_big_file.suffix}'
chunk_filepath = source_big_file.parent / chunk_filename
if chunk_filepath not in intermediate_file_handlers:
file_handler = chuck_filepath.open('w', encoding='utf8')
intermediate_file_handlers[chunk_filepath] = file_handler
if last_chunk_filepath:
last_file_hanlder = intermediate_file_handlers[last_chunk_filepath]
last_file_handler.close()
yield last_chunk_filepath
else:
file_handler = intermediate_file_handlers[chunk_filepath]
file_handler.write(line)
last_chunk_filepath = chunk_filepath
# output last one
yield last_chunk_filepath
def json_to_csv(json_filepath: Path) -> Path:
csv_filename = f'{json_filepath.stem}.csv'
csv_filepath = json_filepath.parent / csv_filename
with csv_filepath.open('w', encoding='utf8') as csv_out, json_filepath.open('r', encoding='utf8') as json_in:
dwriter = csv.DictWriter(csv_out)
headers_written = False
for json_line in json_in:
data = json.loads(json_line)
if not headers_written:
# create header record
headers = {k:k for k in data.keys()}
dwriter.writeline(headers)
headers_written = True
dwriter.writeline(data)
return csv_filepath
with ProcessPoolExecutor() as pool:
futures = []
for chunk_filepath in chuck_file_by_line(source_big_file):
future = pool.submit(json_to_csv, chunk_filepath)
futures.append(future)
# wait for all to finish
for future in futures:
csv_filepath = future.result(timeout=None) # waits until complete
print(f'conversion complete> csv filepath: {csv_filepath}')