使用多处理的flle处理-python

使用多处理的flle处理-python,python,Python,我是Python初学者,尝试添加几行代码将json转换为csv并返回json。有数千个文件(大小300 MB)要转换和处理。使用当前程序(使用1个CPU),我无法使用服务器的16个CPU,需要建议对程序进行微调以进行多处理。下面是我的python 3.7版本代码 import json import csv import os os.chdir('/stagingData/Scripts/test') for JsonFile in os.listdir(os.getcwd()):

我是Python初学者,尝试添加几行代码将json转换为csv并返回json。有数千个文件(大小300 MB)要转换和处理。使用当前程序(使用1个CPU),我无法使用服务器的16个CPU,需要建议对程序进行微调以进行多处理。下面是我的python 3.7版本代码

import json
import csv
import os

os.chdir('/stagingData/Scripts/test')

for JsonFile in os.listdir(os.getcwd()):
    PartialFileName = JsonFile.split('.')[0]

    j = 1
    with open(PartialFileName +".csv", 'w', newline='') as Output_File:

        with open(JsonFile) as fileHandle:
            i = 1
            for Line in fileHandle:
                try:
                    data = json.loads(Line, parse_float=str)
                except:
                    print("Can't load line {}".format(i))
                if i == 1:
                    header = data.keys()
                    output = csv.writer(Output_File)
                    output.writerow(header) #Writes header row
                i += 1
                output.writerow(data.values()) #writes values row
        j += 1

感谢对多处理逻辑的建议

因为您有许多文件,文档中最简单的多处理示例应该适合您


您也可以尝试将
listdir
替换为,这样在开始之前就不必返回所有目录项。

如果您有一个大文件要更有效地处理,我建议如下:

  • 将文件分割成块

  • 创建一个进程来处理每个块

  • (如有必要)将处理后的块合并回单个文件

  • 大概是这样的:

    import csv
    import json
    from pathlib import Path
    from concurrent.futures import ProcessPoolExecutor
    
    source_big_file = Path('/path/to/file')
    
    def chunk_file_by_line(source_filepath: Path, chunk_size: int = 10_000):
        chunk_line_size = 10_000
        intermediate_file_handlers = {}
        last_chunk_filepath = None
        with source_big_file.open('r', encoding='utf8') as big:
            for line_number, line in big:
                group = line_number - (line_number % chunk_line_size)
                chunk_filename = f'{source_big_file.stem}.g{group}{source_big_file.suffix}'
                chunk_filepath = source_big_file.parent / chunk_filename
                if chunk_filepath not in intermediate_file_handlers:
                    file_handler = chuck_filepath.open('w', encoding='utf8')
                    intermediate_file_handlers[chunk_filepath] = file_handler
                    if last_chunk_filepath:
                        last_file_hanlder = intermediate_file_handlers[last_chunk_filepath]
                        last_file_handler.close()
                        yield last_chunk_filepath
                else:
                    file_handler = intermediate_file_handlers[chunk_filepath]
                file_handler.write(line)
                last_chunk_filepath  = chunk_filepath
        # output last one
        yield last_chunk_filepath
    
    
    def json_to_csv(json_filepath: Path) -> Path:
        csv_filename = f'{json_filepath.stem}.csv'
        csv_filepath = json_filepath.parent / csv_filename
        with csv_filepath.open('w', encoding='utf8') as csv_out, json_filepath.open('r', encoding='utf8') as json_in:
            dwriter = csv.DictWriter(csv_out)
            headers_written = False
            for json_line in json_in:
                data = json.loads(json_line)
                if not headers_written:
                    # create header record
                    headers = {k:k for k in data.keys()}
                    dwriter.writeline(headers)                
                    headers_written = True
                dwriter.writeline(data)
        return csv_filepath
    
    
    with ProcessPoolExecutor() as pool:
        futures = []
        for chunk_filepath in chuck_file_by_line(source_big_file):
            future = pool.submit(json_to_csv, chunk_filepath)
            futures.append(future)
    
        # wait for all to finish
        for future in futures:
            csv_filepath = future.result(timeout=None)  # waits until complete
            print(f'conversion complete> csv filepath: {csv_filepath}')
    

    我建议阅读Python中的线程。或者Python3具有实际的协处理操作,这类似于较轻的线程。我回到了Python2.7,所以我不知道Python3中有什么很酷的新东西。我认为Python中的简单线程无论采用哪种方式都会更简单,因为您不希望一次运行太多线程(当然不是每个文件一个线程)。您希望每个文件运行一个线程。定义一个函数,该函数接受filename/fileobject并执行此处的操作,然后使用
    Pool.map
    或类似功能。我不知道
    json
    模块在解析时是否会释放GIL,所以请尝试使用
    ThreadPoolExecutor
    ,如果您没有看到使用多个内核,请将其更改为
    ProcessPoolExecutor
    。我尝试使用pandas数据帧,但性能非常差。例如,执行需要5分钟,而在当前代码中,它将在15秒内完成。为多处理增加了几行,但对输出感到失望。我能够用单CPU处理180秒,用16个CPU处理150秒。有更好的方法吗?文件转换是一个单独的函数&在这里传递参数。如果name_uuu==''uuuu main':worker_ucount=16个进程=[]用于范围内的i(worker_ucount):p=multiprocessing.Process(target=fileConversion,args=('/stagingData/Scripts/test'))processs.append(p)p.start()用于进程中的进程:Process.join()这可能是因为大部分时间都花在读取和写入存储器上,只有在转换任务占用CPU较多的情况下,您才会看到显著的加速。在运行程序时,我查看了CPU使用率,显示所有16个CPU的使用率都超过95%。所以,这个程序是CPU密集型的。线程限制可能会增加处理多个线程的位复杂性。处理单个文件/处理器的任何其他建议。由于这些文件将在下游处理中进一步处理,因此希望避免错误。感谢您的反馈。谢谢。是否要合并回单个文件?否。每个文件将分别进行处理和转换。拥有数千个文件,希望将负载分布到16个CPU上,并执行了几次。尝试了几次执行。无需多处理—执行在85-90秒内完成,其中11个文件(3 GB)通过多处理进行转换—需要更多时间。在150秒内完成不确定,发生了什么?
    import csv
    import json
    from pathlib import Path
    from concurrent.futures import ProcessPoolExecutor
    
    source_big_file = Path('/path/to/file')
    
    def chunk_file_by_line(source_filepath: Path, chunk_size: int = 10_000):
        chunk_line_size = 10_000
        intermediate_file_handlers = {}
        last_chunk_filepath = None
        with source_big_file.open('r', encoding='utf8') as big:
            for line_number, line in big:
                group = line_number - (line_number % chunk_line_size)
                chunk_filename = f'{source_big_file.stem}.g{group}{source_big_file.suffix}'
                chunk_filepath = source_big_file.parent / chunk_filename
                if chunk_filepath not in intermediate_file_handlers:
                    file_handler = chuck_filepath.open('w', encoding='utf8')
                    intermediate_file_handlers[chunk_filepath] = file_handler
                    if last_chunk_filepath:
                        last_file_hanlder = intermediate_file_handlers[last_chunk_filepath]
                        last_file_handler.close()
                        yield last_chunk_filepath
                else:
                    file_handler = intermediate_file_handlers[chunk_filepath]
                file_handler.write(line)
                last_chunk_filepath  = chunk_filepath
        # output last one
        yield last_chunk_filepath
    
    
    def json_to_csv(json_filepath: Path) -> Path:
        csv_filename = f'{json_filepath.stem}.csv'
        csv_filepath = json_filepath.parent / csv_filename
        with csv_filepath.open('w', encoding='utf8') as csv_out, json_filepath.open('r', encoding='utf8') as json_in:
            dwriter = csv.DictWriter(csv_out)
            headers_written = False
            for json_line in json_in:
                data = json.loads(json_line)
                if not headers_written:
                    # create header record
                    headers = {k:k for k in data.keys()}
                    dwriter.writeline(headers)                
                    headers_written = True
                dwriter.writeline(data)
        return csv_filepath
    
    
    with ProcessPoolExecutor() as pool:
        futures = []
        for chunk_filepath in chuck_file_by_line(source_big_file):
            future = pool.submit(json_to_csv, chunk_filepath)
            futures.append(future)
    
        # wait for all to finish
        for future in futures:
            csv_filepath = future.result(timeout=None)  # waits until complete
            print(f'conversion complete> csv filepath: {csv_filepath}')