如何减少Python脚本的执行时间
我有一个巨大的数据集,其中包含大约4亿条记录,这些记录将记录从行转换到列 输入数据集:如何减少Python脚本的执行时间,python,python-3.x,pandas,Python,Python 3.x,Pandas,我有一个巨大的数据集,其中包含大约4亿条记录,这些记录将记录从行转换到列 输入数据集: +------+------------------------------------------------------------------+----------------------------------+--+ | HHID | VAL_CD64 | VAL_
+------+------------------------------------------------------------------+----------------------------------+--+
| HHID | VAL_CD64 | VAL_CD32 | |
+------+------------------------------------------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 373aeda34c0b4ab91a02ecf55af58e15 | |
| 203 | 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cdb5a2bc38e278d229c8 | 6f3606577eadacef1b956307558a1efd | |
| 203 | a18adc1bcae1b570a610b13565b82e5647f05fef8a4680bd6ccdd717cdd34af7 | 332321ab150879e930869c15b1d10c83 | |
| 720 | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | 2c4f97a04f02db5a36a85f48dab39b5b | |
| 720 | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | 2111293e946703652070968b224875c9 | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 5c80a555fcda02d028fc60afa29c4a40 | |
| 348 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 6c10cd11b805fa57d2ca36df91654576 | |
| 348 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 403 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | 71a91c4768bd314f3c9dc74e9c7937e8 | |
+------+------------------------------------------------------------------+----------------------------------+--+
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
| HHID | VAL1_CD64 | VAL2_CD64 | VAL3_CD64 | VAL1_CD32 | VAL2_CD32 | VAL3_CD32 | |
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cdb5a2bc38e278d229c8 | a18adc1bcae1b570a610b13565b82e5647f05fef8a4680bd6ccdd717cdd34af7 | 373aeda34c0b4ab91a02ecf55af58e15 | 6f3606577eadacef1b956307558a1efd | 332321ab150879e930869c15b1d10c83 | |
| 720 | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | | 2c4f97a04f02db5a36a85f48dab39b5b | 2111293e946703652070968b224875c9 | | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 5c80a555fcda02d028fc60afa29c4a40 | 6c10cd11b805fa57d2ca36df91654576 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 403 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | | | 71a91c4768bd314f3c9dc74e9c7937e8 | | | |
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
输出数据集:
+------+------------------------------------------------------------------+----------------------------------+--+
| HHID | VAL_CD64 | VAL_CD32 | |
+------+------------------------------------------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 373aeda34c0b4ab91a02ecf55af58e15 | |
| 203 | 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cdb5a2bc38e278d229c8 | 6f3606577eadacef1b956307558a1efd | |
| 203 | a18adc1bcae1b570a610b13565b82e5647f05fef8a4680bd6ccdd717cdd34af7 | 332321ab150879e930869c15b1d10c83 | |
| 720 | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | 2c4f97a04f02db5a36a85f48dab39b5b | |
| 720 | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | 2111293e946703652070968b224875c9 | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 5c80a555fcda02d028fc60afa29c4a40 | |
| 348 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 6c10cd11b805fa57d2ca36df91654576 | |
| 348 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 403 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | 71a91c4768bd314f3c9dc74e9c7937e8 | |
+------+------------------------------------------------------------------+----------------------------------+--+
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
| HHID | VAL1_CD64 | VAL2_CD64 | VAL3_CD64 | VAL1_CD32 | VAL2_CD32 | VAL3_CD32 | |
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cdb5a2bc38e278d229c8 | a18adc1bcae1b570a610b13565b82e5647f05fef8a4680bd6ccdd717cdd34af7 | 373aeda34c0b4ab91a02ecf55af58e15 | 6f3606577eadacef1b956307558a1efd | 332321ab150879e930869c15b1d10c83 | |
| 720 | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | | 2c4f97a04f02db5a36a85f48dab39b5b | 2111293e946703652070968b224875c9 | | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 5c80a555fcda02d028fc60afa29c4a40 | 6c10cd11b805fa57d2ca36df91654576 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 403 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | | | 71a91c4768bd314f3c9dc74e9c7937e8 | | | |
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
我当前的Python代码是:
import pandas as pd
import os
import shutil
import glob
import time
start=time.time()
print('\nFile Processing Started\n')
path=r'Sample Data'
input_file=r'test'
output_file=r'test_MOD'
chunk=pd.read_csv(input_file+'.psv',sep='|',chunksize=400000,dtype={"HHID":"string","VAL_CD64":"string","VAL_CD32":"string"})
chunk_list=[]
for c_no in chunk:
chunk_list.append(c_no)
file_no=1
rec_cnt=0
for i in chunk_list:
start2=time.time()
rec_cnt=rec_cnt+len(i)
rec_cnt2=0
rec_cnt2=len(i)
df=pd.DataFrame(i)
df_ = df.groupby('HHID').agg({'VAL_CD64': list, 'VAL_CD32': list})
data = []
for col in df_.columns:
d = pd.DataFrame(df_[col].values.tolist(), index=df_.index)
d.columns = [f'{col}_{i}' for i in map(str, range(1, len(d.columns)+1))]
data.append(d)
res = pd.concat(data, axis=1)
res.to_csv(output_file+str(file_no)+'.psv',index=True,sep='|')
with open(output_file+str(file_no)+'.psv','r') as istr:
with open(input_file+str(file_no)+'.psv','w') as ostr:
for line in istr:
line=line.strip('\n')+'|'
print(line,file=ostr)
os.remove(output_file+str(file_no)+'.psv')
file_no+=1
end2=time.time()
duration2=end2-start2
print("\nProcessed "+ str(rec_cnt2)+ " records in "+ str(round((duration2),2))+ " seconds. \nTotal Processed Records: "+str(rec_cnt))
os.remove(input_file+'.psv')
allFiles = glob.glob(path + "/*.psv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open(os.path.join(path,'someoutputfile.csv'), 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
test=os.listdir(path)
for item in test:
if item.endswith(".psv"):
os.remove(os.path.join(path,item))
final_file_name=input_file+'.psv'
os.rename(os.path.join(path,'someoutputfile.csv'),final_file_name)
end=time.time()
duration=end-start
print("\n"+ str(rec_cnt)+ " records added in "+ str(round((duration),2))+ " seconds. \n")
但是,此脚本需要花费大量时间来处理记录。处理一个4亿记录文件花了16个小时
有什么方法可以减少执行时间,加快整个过程吗?pivot不是这样做的吗?即
df1 = df.assign(cols = df.groupby('HHID').cumcount() + 1).\
pivot_table(index='HHID', columns = 'cols', values = ['VAL_CD64', 'VAL_CD32'],
aggfunc = lambda x:x)
df1.columns = [i + '_' + str(j) for i,j in df1.columns]
df1.reset_index()
HHID VAL_CD32_1 VAL_CD32_2 VAL_CD32_3 VAL_CD64_1
0 203 373aeda34c0b4ab91a02ecf55af58e15 6f3606577eadacef1b956307558a1efd 332321ab150879e930869c15b1d10c83 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df...
1 348 5c80a555fcda02d028fc60afa29c4a40 6c10cd11b805fa57d2ca36df91654576 6040b29107adf1a41c4f5964e0ff6dcb 25c7cf022e6651394fa5876814a05b8e593d8c7f29846...
2 403 71a91c4768bd314f3c9dc74e9c7937e8 NaN NaN 3e8da3d63c51434bcd368d6829c7cee490170afc32b51...
3 720 2c4f97a04f02db5a36a85f48dab39b5b 2111293e946703652070968b224875c9 NaN f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e...
除非RAM非常小,否则可以将
chunksize
增加到10000
,100000
,500000
甚至更多。您可以一直尝试,直到出现内存错误
由于代码取决于块的数量,增加该值应该可以提高运行时间。您可以在C模块(下面的链接)中编写处理逻辑,并从Python代码中使用它,这样您就可以使用C-Speed和Pythons简单性
我在脚本中使用了400000的chunksize,这花费了16小时来处理4亿行。当我将此数据帧输出到csv文件时,我得到以下输出:。我国农村居民生活水平(1244)维维权(VAL)维维权(CD32)维权(CD32)3)维权(VAL)维权(CD64)维权(VAL)维权(CD64)维权(维权(CD64)维权(维权)维权(CD32)维权(CD32)维权(瓦权)维权(CD64)维权(维权)维权(维权(CD64)维权(维权)维权(维权(CD64)维权(维权)维权(维权)维权(CD64)维权(维权(维权)维权)维权(CD64)维权(维权)维权(维权(维权)维权(维权)维权)维权)维权(维权维权(维权(CD64)维权(维权)维权)维权)维权(维权维权维权维权维权维权维权维权维权维权维权(CD69B2DF0A314DC9E7C2F8F58A5|“1 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cd。。。名称:VAL_CD64,数据类型:字符串“|”2 a18adc1bcae1b570a610b13565b82e5647f05fef8a4680。。。Name:VAL_CD64,dtype:string“@AbhinavDhiman您是否使用了df1.to_csv()?使用这个:df1.to_csv(output_file+str(file_no)+'.psv',index=True,sep='|')我的代码正在Unix服务器上运行,该服务器没有安装C解释器。我没有安装C解释器的权限。