Python写入csv时的多处理问题

Python写入csv时的多处理问题,python,multithreading,multiprocessing,Python,Multithreading,Multiprocessing,我创建了一个函数,它将日期作为参数,并将生成的输出写入csv。如果我使用28个任务运行多处理池,并且我有一个100个日期的列表,那么输出csv文件中的最后72行的长度是它们应该的两倍(只是最后72行的联合重复) 我的代码: import numpy as np import pandas as pd import multiprocessing #Load the data df = pd.read_csv('data.csv', low_memory=False) list_s = df.d

我创建了一个函数,它将日期作为参数,并将生成的输出写入csv。如果我使用28个任务运行多处理池,并且我有一个100个日期的列表,那么输出csv文件中的最后72行的长度是它们应该的两倍(只是最后72行的联合重复)

我的代码:

import numpy as np
import pandas as pd
import multiprocessing

#Load the data
df = pd.read_csv('data.csv', low_memory=False)
list_s = df.date.unique()
def funk(date):
    ...
    # for each date in df.date.unique() do stuff which gives sample dataframe
    # as an output

    return sample

# list_s is a list of dates I want to calculate function funk for   

def mp_handler():
# 28 is a number of processes I want to run
    p = multiprocessing.Pool(28)
    for result in p.imap(funk, list_s[0:100]):
        result.to_csv('crsp_full.csv', mode='a')


if __name__=='__main__':
    mp_handler()
输出如下所示:

date,port_ret_1,port_ret_2
2010-03-05,0.0,0.002
date,port_ret_1,port_ret_2
2010-02-12,-0.001727,0.009139189315

...
# and after first 28 rows like this
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-03,0.002045,0.00045092025,0.002045,0.00045092025
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-15,-0.006055,-0.00188451972,-0.006055,-0.00188451972
我尝试将
lock()
插入
funk()
,但结果相同,只是需要更多的时间来实现。有没有办法解决这个问题

编辑
funk
看起来像这样<代码>e等同于日期

def funk(e):
    block = pd.DataFrame()
    i = s_list.index(e)
    if i > 19:
        ran = s_list[i-19:i+6]
        ran0 = s_list[i-19:i+1]
        # print ran0
        piv = df.pivot(index='date', columns='permno', values='date')
        # Drop the stocks which do not have returns for the given time window and make the list of suitable stocks
        s = list(piv.loc[ran].dropna(axis=1).columns)
        sample = df[df['permno'].isin(s)]
        sample = sample.loc[ran]
        permno = ['10001', '93422']
        sample = sample[sample['permno'].isin(permno)]
        # print sample.index.unique()
        # get past 20 days returns in additional 20 columns
        for i in range(0, 20):
            sample['r_{}'.format(i)] = sample.groupby('permno')['ret'].shift(i)
        #merge dataset with betas
        sample = pd.merge(sample, betas_aug, left_index=True, right_index=True)
        sample['ex_ret'] = 0

        # calculate expected return
        for i in range(0,20):
            sample['ex_ret'] += sample['ma_beta_{}'.format(i)]*sample['r_{}'.format(i)]
        # print(sample)
        # define a stock into two legs based on expected return
        sample['sign'] = sample['ex_ret'].apply(lambda x: -1 if x<0 else 1)
        # workaround for short leg, multiply returns by -1
        sample['abs_ex_ret'] = sample['ex_ret']*sample['sign']
        # create 5 columns for future realised 5 days returns (multiplied by -1 for short leg)
        for i in range(1,6):
            sample['rp_{}'.format(i)] = sample.groupby(['permno'])['ret'].shift(-i)
            sample['rp_{}'.format(i)] = sample['rp_{}'.format(i)]*sample['sign']
        sample = sample.reset_index(drop=True)
        sample['w_0'] = sample['abs_ex_ret'].div(sample.groupby(['date'])['abs_ex_ret'].transform('sum'))
        for i in range(1, 5):
            sample['w_{}'.format(i)] = sample['w_{}'.format(i-1)]*(1+sample['rp_{}'.format(i)])
        sample = sample.dropna(how='any')
        for k in range(0,20):
            sample.drop(columns = ['ma_beta_{}'.format(k), 'r_{}'.format(k)])
        for k in range(1, 6):
            sample['port_ret_{}'.format(k)] = sample['w_{}'.format(k-1)]*sample['rp_{}'.format(k)]
            q = ['port_ret_{}'.format(k)]
            list_names.extend(q)
        block = sample.groupby('date')[list_names].sum().copy()
    return block
def funk(e):
block=pd.DataFrame()
i=s_列表索引(e)
如果i>19:
ran=s_列表[i-19:i+6]
ran0=s_列表[i-19:i+1]
#打印ran0
piv=df.pivot(index='date',columns='permno',values='date')
#在给定的时间窗口内,放弃没有回报的股票,并列出合适的股票
s=列表(piv.loc[ran].dropna(axis=1).columns)
样本=df[df['permno'].isin(s)]
sample=sample.loc[ran]
permno=['10001','93422']
样本=样本[sample['permno'].isin(permno)]
#打印sample.index.unique()
#在额外的20列中获取过去20天的回报
对于范围(0,20)内的i:
sample['r{}.format(i)]=sample.groupby('permno')['ret'].shift(i)
#将数据集与betas合并
sample=pd.merge(sample,betas\u aug,left\u index=True,right\u index=True)
样本['ex_ret']=0
#计算预期收益
对于范围(0,20)内的i:
样本['ex_ret']+=样本['ma_beta_{}.格式(i)]*样本['r_{}.格式(i)]
#打印(样本)
#根据预期回报将股票分为两部分

sample['sign']=sample['ex_ret'].apply(lambda x:-1如果x为什么要使用多处理写入文件,非常确定在所有情况下都会有数据竞争,因为简单的CSV不支持这样的功能afaik@aws_apprentice这也是我的第一个想法,但OP实际上只是从主进程写入文件(除非芬克也写了些东西)。所以不是这样。我打赌funk中存在范围问题,df从一个函数调用增长到下一个函数调用。我试图将这些数据帧附加到全局数据帧,然后将其写入csv,但在某个时间点,全局df的列数与
结果 。这意味着全局(更新)数据框看起来像上面给出的示例文件的第一部分,idk why
result
dataframe看起来像输出csv文件的第二部分(列数加倍)@swenzel我将检查我的函数,并用更详细的代码编辑问题.isin(s)].copy()
。正如@stovfl所指出的,您最好使用
result.to\u csv('crsp\u full.csv',mode='a',header=False)
,第一个除外。