Python写入csv时的多处理问题
我创建了一个函数,它将日期作为参数,并将生成的输出写入csv。如果我使用28个任务运行多处理池,并且我有一个100个日期的列表,那么输出csv文件中的最后72行的长度是它们应该的两倍(只是最后72行的联合重复) 我的代码:Python写入csv时的多处理问题,python,multithreading,multiprocessing,Python,Multithreading,Multiprocessing,我创建了一个函数,它将日期作为参数,并将生成的输出写入csv。如果我使用28个任务运行多处理池,并且我有一个100个日期的列表,那么输出csv文件中的最后72行的长度是它们应该的两倍(只是最后72行的联合重复) 我的代码: import numpy as np import pandas as pd import multiprocessing #Load the data df = pd.read_csv('data.csv', low_memory=False) list_s = df.d
import numpy as np
import pandas as pd
import multiprocessing
#Load the data
df = pd.read_csv('data.csv', low_memory=False)
list_s = df.date.unique()
def funk(date):
...
# for each date in df.date.unique() do stuff which gives sample dataframe
# as an output
return sample
# list_s is a list of dates I want to calculate function funk for
def mp_handler():
# 28 is a number of processes I want to run
p = multiprocessing.Pool(28)
for result in p.imap(funk, list_s[0:100]):
result.to_csv('crsp_full.csv', mode='a')
if __name__=='__main__':
mp_handler()
输出如下所示:
date,port_ret_1,port_ret_2
2010-03-05,0.0,0.002
date,port_ret_1,port_ret_2
2010-02-12,-0.001727,0.009139189315
...
# and after first 28 rows like this
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-03,0.002045,0.00045092025,0.002045,0.00045092025
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-15,-0.006055,-0.00188451972,-0.006055,-0.00188451972
我尝试将lock()
插入funk()
,但结果相同,只是需要更多的时间来实现。有没有办法解决这个问题
编辑funk
看起来像这样<代码>e等同于日期
def funk(e):
block = pd.DataFrame()
i = s_list.index(e)
if i > 19:
ran = s_list[i-19:i+6]
ran0 = s_list[i-19:i+1]
# print ran0
piv = df.pivot(index='date', columns='permno', values='date')
# Drop the stocks which do not have returns for the given time window and make the list of suitable stocks
s = list(piv.loc[ran].dropna(axis=1).columns)
sample = df[df['permno'].isin(s)]
sample = sample.loc[ran]
permno = ['10001', '93422']
sample = sample[sample['permno'].isin(permno)]
# print sample.index.unique()
# get past 20 days returns in additional 20 columns
for i in range(0, 20):
sample['r_{}'.format(i)] = sample.groupby('permno')['ret'].shift(i)
#merge dataset with betas
sample = pd.merge(sample, betas_aug, left_index=True, right_index=True)
sample['ex_ret'] = 0
# calculate expected return
for i in range(0,20):
sample['ex_ret'] += sample['ma_beta_{}'.format(i)]*sample['r_{}'.format(i)]
# print(sample)
# define a stock into two legs based on expected return
sample['sign'] = sample['ex_ret'].apply(lambda x: -1 if x<0 else 1)
# workaround for short leg, multiply returns by -1
sample['abs_ex_ret'] = sample['ex_ret']*sample['sign']
# create 5 columns for future realised 5 days returns (multiplied by -1 for short leg)
for i in range(1,6):
sample['rp_{}'.format(i)] = sample.groupby(['permno'])['ret'].shift(-i)
sample['rp_{}'.format(i)] = sample['rp_{}'.format(i)]*sample['sign']
sample = sample.reset_index(drop=True)
sample['w_0'] = sample['abs_ex_ret'].div(sample.groupby(['date'])['abs_ex_ret'].transform('sum'))
for i in range(1, 5):
sample['w_{}'.format(i)] = sample['w_{}'.format(i-1)]*(1+sample['rp_{}'.format(i)])
sample = sample.dropna(how='any')
for k in range(0,20):
sample.drop(columns = ['ma_beta_{}'.format(k), 'r_{}'.format(k)])
for k in range(1, 6):
sample['port_ret_{}'.format(k)] = sample['w_{}'.format(k-1)]*sample['rp_{}'.format(k)]
q = ['port_ret_{}'.format(k)]
list_names.extend(q)
block = sample.groupby('date')[list_names].sum().copy()
return block
def funk(e):
block=pd.DataFrame()
i=s_列表索引(e)
如果i>19:
ran=s_列表[i-19:i+6]
ran0=s_列表[i-19:i+1]
#打印ran0
piv=df.pivot(index='date',columns='permno',values='date')
#在给定的时间窗口内,放弃没有回报的股票,并列出合适的股票
s=列表(piv.loc[ran].dropna(axis=1).columns)
样本=df[df['permno'].isin(s)]
sample=sample.loc[ran]
permno=['10001','93422']
样本=样本[sample['permno'].isin(permno)]
#打印sample.index.unique()
#在额外的20列中获取过去20天的回报
对于范围(0,20)内的i:
sample['r{}.format(i)]=sample.groupby('permno')['ret'].shift(i)
#将数据集与betas合并
sample=pd.merge(sample,betas\u aug,left\u index=True,right\u index=True)
样本['ex_ret']=0
#计算预期收益
对于范围(0,20)内的i:
样本['ex_ret']+=样本['ma_beta_{}.格式(i)]*样本['r_{}.格式(i)]
#打印(样本)
#根据预期回报将股票分为两部分
sample['sign']=sample['ex_ret'].apply(lambda x:-1如果x为什么要使用多处理写入文件,非常确定在所有情况下都会有数据竞争,因为简单的CSV不支持这样的功能afaik@aws_apprentice这也是我的第一个想法,但OP实际上只是从主进程写入文件(除非芬克也写了些东西)。所以不是这样。我打赌funk中存在范围问题,df从一个函数调用增长到下一个函数调用。我试图将这些数据帧附加到全局数据帧,然后将其写入csv,但在某个时间点,全局df的列数与结果 。这意味着全局(更新)数据框看起来像上面给出的示例文件的第一部分,idk whyresult
dataframe看起来像输出csv文件的第二部分(列数加倍)@swenzel我将检查我的函数,并用更详细的代码编辑问题.isin(s)].copy()
。正如@stovfl所指出的,您最好使用result.to\u csv('crsp\u full.csv',mode='a',header=False)
,第一个除外。