使用python同时读取不同的json文件集_Python_Json_Python 3.x_Pandas_Multithreading

使用python同时读取不同的json文件集

python json python-3.x pandas multithreading

使用python同时读取不同的json文件集,python,json,python-3.x,pandas,multithreading,Python,Json,Python 3.x,Pandas,Multithreading,我有两组文件b和c（JSON）。每个文件中的文件数通常在500-1000之间。现在我正在分别阅读这篇文章。我可以使用多线程同时阅读这些内容吗？我有足够的内存和处理器 yc=no of c files yb=no of b files c_output_transaction_list =[] for num in range(yc): c_json_file='./output/d_c_'+str(num)+'.json' print(c_json_file) c_tr

我有两组文件b和c（JSON）。每个文件中的文件数通常在500-1000之间。现在我正在分别阅读这篇文章。我可以使用多线程同时阅读这些内容吗？我有足够的内存和处理器

yc=no of c files
yb=no of b files

c_output_transaction_list =[]
for num in range(yc):
    c_json_file='./output/d_c_'+str(num)+'.json'
    print(c_json_file)
    c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
    c_output_transaction_list.extend(c_transaction_list)
df_res_c= pd.DataFrame(c_output_transaction_list) 


b_output_transaction_list =[]
for num in range(yb):
    b_json_file='./output/d_b_'+str(num)+'.json'
    print(b_json_file)
    b_transaction_list = json.load(open(b_json_file))['data']['transaction_list']
    b_output_transaction_list.extend(b_transaction_list)
df_res_b= pd.DataFrame(b_output_transaction_list)

我使用这种方法将数百个文件并行读取到最终的数据帧中。在没有数据的情况下，您必须验证它是否符合您的要求。阅读多进程帮助文档将有所帮助。我在linux（aws ec2读取s3文件）和windows上使用相同的代码读取相同的s3文件。我发现这样做节省了很多时间

import os
import pandas as pd
from multiprocessing import Pool
# you set the number of processors or just take the cpu_count from the os object. playing around with this does make a difference. For me using the max isn't always the fast overall time
num_proc = os.cpu_count()

# define the funciton that creates a dataframe from your file
# note, this is different where you build the list the create a dataframe at the end
def json_parse(c_json_file):
    c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
    return pd.DataFrame(c_transaction_list)

# this is multiprocessing function that feeds the file names to the parsing function
# if you don't pass num_proc it defaults to 4
def json_multiprocess(fn_list, num_proc=4):
    with Pool(num_proc) as pool:
        # I use starmap, you may just be able use map
        # if you pass more than the file name, starmap handles zip() very well
        r = pool.starmap(json_parse, fn_list, 15)
        pool.close()
        pool.join()
    return r

# build your file list first
yc=no of c files
flist = []
for num in range(yc):
    c_json_file='./output/d_c_'+str(num)+'.json'
    flist.append(c_json_file)

# get a list of of your intermediate dataframes
dfs = json_multiprocess(flist, num_proc=num_proc)
# concat your dataframe
df_res_c = pd.concat(dfs)

然后对下一组文件执行相同的操作。。。

使用Aelarion评论中的示例帮助构建文件

向I/O绑定处理添加并行性只会使其速度变慢。这可能会回答您的问题吗？-请注意关于Linux与Windows性能的评论。这是否回答了您的问题？