Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/performance/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/elixir/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
性能:Python DataFrame.to_csv append逐渐变慢 初始问题:_Python_Performance_Pandas_Append_Export To Csv - Fatal编程技术网

性能:Python DataFrame.to_csv append逐渐变慢 初始问题:

性能:Python DataFrame.to_csv append逐渐变慢 初始问题:,python,performance,pandas,append,export-to-csv,Python,Performance,Pandas,Append,Export To Csv,我正在遍历几千个pickle文件,其中包含Python数据帧,这些数据帧的行数不同(大约在600到1300之间),但共谋数不同(确切地说是636)。然后我转换它们(对每一个都完全相同的转换),并使用DataFrame.to_csv()方法将它们附加到csv文件中 to_csv代码摘录: if picklefile == '0000.p': dftemp.to_csv(finalnormCSVFile) else: dftemp.to_csv(finalnormCSVFile, m

我正在遍历几千个pickle文件,其中包含Python数据帧,这些数据帧的行数不同(大约在600到1300之间),但共谋数不同(确切地说是636)。然后我转换它们(对每一个都完全相同的转换),并使用
DataFrame.to_csv()
方法将它们附加到csv文件中

to_csv
代码摘录:

if picklefile == '0000.p':
    dftemp.to_csv(finalnormCSVFile)
else:
    dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
让我烦恼的是,它启动得非常快,但性能却呈指数级下降,我保存了一份处理时间日志:

start: 2015-03-24 03:26:36.958058

2015-03-24 03:26:36.958058
count = 0
time: 0:00:00

2015-03-24 03:30:53.254755
count = 100
time: 0:04:16.296697

2015-03-24 03:39:16.149883
count = 200
time: 0:08:22.895128

2015-03-24 03:51:12.247342
count = 300
time: 0:11:56.097459

2015-03-24 04:06:45.099034
count = 400
time: 0:15:32.851692

2015-03-24 04:26:09.411652
count = 500
time: 0:19:24.312618

2015-03-24 04:49:14.519529
count = 600
time: 0:23:05.107877

2015-03-24 05:16:30.175175
count = 700
time: 0:27:15.655646

2015-03-24 05:47:04.792289
count = 800
time: 0:30:34.617114

2015-03-24 06:21:35.137891
count = 900
time: 0:34:30.345602

2015-03-24 06:59:53.313468
count = 1000
time: 0:38:18.175577

2015-03-24 07:39:29.805270
count = 1100
time: 0:39:36.491802

2015-03-24 08:20:30.852613
count = 1200
time: 0:41:01.047343

2015-03-24 09:04:14.613948
count = 1300
time: 0:43:43.761335

2015-03-24 09:51:45.502538
count = 1400
time: 0:47:30.888590

2015-03-24 11:09:48.366950
count = 1500
time: 1:18:02.864412

2015-03-24 13:02:33.152289
count = 1600
time: 1:52:44.785339

2015-03-24 15:30:58.534493
count = 1700
time: 2:28:25.382204

2015-03-24 18:09:40.391639
count = 1800
time: 2:38:41.857146

2015-03-24 21:03:19.204587
count = 1900
time: 2:53:38.812948

2015-03-25 00:00:05.855970
count = 2000
time: 2:56:46.651383

2015-03-25 03:53:05.020944
count = 2100
time: 3:52:59.164974

2015-03-25 05:02:16.534149
count = 2200
time: 1:09:11.513205

2015-03-25 06:07:32.446801
count = 2300
time: 1:05:15.912652

2015-03-25 07:13:45.075216
count = 2400
time: 1:06:12.628415

2015-03-25 08:20:17.927286
count = 2500
time: 1:06:32.852070

2015-03-25 09:27:20.676520
count = 2600
time: 1:07:02.749234

2015-03-25 10:35:01.657199
count = 2700
time: 1:07:40.980679

2015-03-25 11:43:20.788178
count = 2800
time: 1:08:19.130979

2015-03-25 12:53:57.734390
count = 2900
time: 1:10:36.946212

2015-03-25 14:07:20.936314
count = 3000
time: 1:13:23.201924

2015-03-25 15:22:47.076786
count = 3100
time: 1:15:26.140472

2015-03-25 19:51:10.776342
count = 3200
time: 4:28:23.699556

2015-03-26 03:06:47.372698
count = 3300
time: 7:15:36.596356

count = 3324
end of cycle: 2015-03-26 03:59:54.161842

end: 2015-03-26 03:59:54.161842
total duration: 2 days, 0:33:17.203784
更新#1: 我按照你的建议@Alexander做了,但这肯定与
to_csv()
方法有关:

start: 2015-03-26 05:18:25.948410

2015-03-26 05:18:25.948410
count = 0
time: 0:00:00

2015-03-26 05:20:30.425041
count = 100
time: 0:02:04.476631

2015-03-26 05:22:27.680582
count = 200
time: 0:01:57.255541

2015-03-26 05:24:26.012598
count = 300
time: 0:01:58.332016

2015-03-26 05:26:16.542835
count = 400
time: 0:01:50.530237

2015-03-26 05:27:58.063196
count = 500
time: 0:01:41.520361

2015-03-26 05:29:45.769580
count = 600
time: 0:01:47.706384

2015-03-26 05:31:44.537213
count = 700
time: 0:01:58.767633

2015-03-26 05:33:41.591837
count = 800
time: 0:01:57.054624

2015-03-26 05:35:43.963843
count = 900
time: 0:02:02.372006

2015-03-26 05:37:46.171643
count = 1000
time: 0:02:02.207800

2015-03-26 05:38:36.493399
count = 1100
time: 0:00:50.321756

2015-03-26 05:39:42.123395
count = 1200
time: 0:01:05.629996

2015-03-26 05:41:13.122048
count = 1300
time: 0:01:30.998653

2015-03-26 05:42:41.885513
count = 1400
time: 0:01:28.763465

2015-03-26 05:44:20.937519
count = 1500
time: 0:01:39.052006

2015-03-26 05:46:16.012842
count = 1600
time: 0:01:55.075323

2015-03-26 05:48:14.727444
count = 1700
time: 0:01:58.714602

2015-03-26 05:50:15.792909
count = 1800
time: 0:02:01.065465

2015-03-26 05:51:48.228601
count = 1900
time: 0:01:32.435692

2015-03-26 05:52:22.755937
count = 2000
time: 0:00:34.527336

2015-03-26 05:52:58.289474
count = 2100
time: 0:00:35.533537

2015-03-26 05:53:39.406794
count = 2200
time: 0:00:41.117320

2015-03-26 05:54:11.348939
count = 2300
time: 0:00:31.942145

2015-03-26 05:54:43.057281
count = 2400
time: 0:00:31.708342

2015-03-26 05:55:19.483600
count = 2500
time: 0:00:36.426319

2015-03-26 05:55:52.216424
count = 2600
time: 0:00:32.732824

2015-03-26 05:56:27.409991
count = 2700
time: 0:00:35.193567

2015-03-26 05:57:00.810139
count = 2800
time: 0:00:33.400148

2015-03-26 05:58:17.109425
count = 2900
time: 0:01:16.299286

2015-03-26 05:59:31.021719
count = 3000
time: 0:01:13.912294

2015-03-26 06:00:49.200303
count = 3100
time: 0:01:18.178584

2015-03-26 06:02:07.732028
count = 3200
time: 0:01:18.531725

2015-03-26 06:03:28.518541
count = 3300
time: 0:01:20.786513

count = 3324
end of cycle: 2015-03-26 06:03:47.321182

end: 2015-03-26 06:03:47.321182
total duration: 0:45:21.372772
根据要求,源代码:

import pickle
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join
from datetime import datetime

# Defining function to deep copy pandas data frame:
def very_deep_copy(self):
    return pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())

# Adding function to Dataframe module:    
pd.DataFrame.very_deep_copy = very_deep_copy

#Define Data Frame Header:
head = [
    'ConcatIndex', 'Concatenated String Index', 'FileID', ..., 'Attribute<autosave>', 'Attribute<bgcolor>'
    ]
exclude = [
    'ConcatIndex', 'Concatenated String Index', 'FileID', ... , 'Real URL Array'
    ]

path = "./dataset_final/"
pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]
finalnormCSVFile = 'finalNormalizedDataFrame2.csv'

count = 0
start_time = datetime.now()
t1 = start_time
print("start: " + str(start_time) + "\n")


for picklefile in pickleFiles: 
    if count%100 == 0:
        t2 = datetime.now()
        print(str(t2))
        print('count = ' + str(count))
        print('time: ' + str(t2 - t1) + '\n')
        t1 = t2

    #DataFrame Manipulation:
    df = pd.read_pickle(path + picklefile)

    df['ConcatIndex'] = 100000*df.FileID + df.ID
    for i in range(0, len(df)):
        df.loc[i, 'Concatenated String Index'] = str(df['ConcatIndex'][i]).zfill(10)
    df.index = df.ConcatIndex


    #DataFrame Normalization:
    dftemp = df.very_deep_copy()
    for string in head:
        if string in exclude:
            if string != 'ConcatIndex':
                dftemp.drop(string, axis=1, inplace=True)
        else:
            if 'Real ' in string:
                max = pd.DataFrame.max(df[string.strip('Real ')])
            elif 'child' in string:
                max = pd.DataFrame.max(df[string.strip('child')+'desc'])
            else:
                max = pd.DataFrame.max(df[string])

            if max != 0:
                dftemp[string] = dftemp[string]/max

    dftemp.drop('ConcatIndex', axis=1, inplace=True)

    #Saving DataFrame in CSV:
    if picklefile == '0000.p':
        dftemp.to_csv(finalnormCSVFile)
    else:
        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

    count += 1

print('count = ' + str(count))
cycle_end_time = datetime.now()
print("end of cycle: " + str(cycle_end_time) + "\n")

end_time = datetime.now()
print("end: " + str(end_time))
print('total duration: ' + str(end_time - start_time) + '\n')
我将重新运行它以进行确认,但看起来它肯定与pandas的
to_csv()
方法有关,因为大部分运行时间都用于io和csv writer。为什么会有这种效果?有什么建议吗

更新#3: 好的,我做了一个完整的
%prun
测试,实际上几乎90%的时间都花在
{method'close'of'\u io.TextIOWrapper'objects}
上。所以我想问题出在这里。。。你们觉得怎么样

我的问题是:
  • 性能下降的原因是什么
  • pandas.DataFrames.to_csv()
    append模式是否在每次写入时加载整个文件
  • 有没有办法加强这一过程

  • 我的猜测是,它来自于您正在进行的
    非常深入的拷贝
    ,您是否检查了一段时间内的内存使用情况?内存可能未正确释放

    如果这是问题所在,您可以执行以下操作之一:

    1) 完全避免复制(更好的性能)

    2) 偶尔使用
    gc.collect()
    强制垃圾收集

    请参阅“”以了解可能相关的问题,以及

    编辑

    删除副本的解决方案是:

    1) 在标准化之前,为每列存储标准化常量

    2) 在标准化后删除不需要的列

    # Get the normalizing constant for each column.
    max = {}
    
    for string in head:
        if string not in exclude:
            if 'Real ' in string:
               max[string] = df[string.strip('Real ')].max()
            elif 'child' in string:
               max[string] = df[string.strip('child')+'desc'].max()
            else:
               max[string] = df[string].max()
    
    # Actual normalization, each column is divided by
    # its constant if possible. 
    for key,value in max.items():
        if value != 0:
            df[key] /= value
    
    # Drop the excluded columns 
    df.drop(exclude, axis=1, inplace=True)
    

    在这种情况下,您应该分析您的代码(以查看哪些函数调用占用的时间最多),这样您就可以根据经验检查它在
    读取csv
    中的速度是否确实慢,而不是在其他地方

    从你的代码来看:首先这里有很多复制和循环(没有足够的矢量化)。。。每次看到循环时,都要寻找删除它的方法。其次,当您使用zfill之类的工具时,我想知道您是否希望
    转换为_fwf
    (固定宽度格式),而不是
    转换为_csv

    一些健全性测试:是否有些文件明显比其他文件大(这可能导致您进行交换)?你确定最大的文件只有1200行吗??你检查过这个了吗?e、 g.使用
    wc-l

    我认为这不太可能是垃圾收集。。(正如另一份答复中所建议的那样)


    下面是对代码的一些改进,这些改进应该会改进运行时

    列是固定的,我将提取列计算,并对实、子和其他规格化进行向量化。使用apply而不是迭代(对于zfill)

    作为一个风格点,我可能会选择将这些部分包装成函数,这也意味着如果这真的是问题的话,更多的东西可以被gc处理


    另一个更快的选择是使用pytables(HDF5Store),如果您不需要将结果输出转换为csv(但我希望您可以)

    到目前为止,最好的方法是分析代码。e、 g.在ipython中使用
    %prun
    ,例如,请参阅。然后,您可以看到它肯定是
    read\u csv
    ,具体是在哪里(您的代码的哪行和熊猫的代码的哪行)


    啊哈,我错过了你将所有这些附加到一个csv文件中。在修剪中,它显示大部分时间都花在
    close
    上,因此让我们保持文件打开:

    # outside of the for loop (so the file is opened and closed only once)
    f = open(finalnormCSVFile, 'w')
    
    ...
    for picklefile in ...
    
        if picklefile == '0000.p':
            dftemp.to_csv(f)
        else:
            dftemp.to_csv(f, mode='a', header=False)
    ...
    
    f.close()
    

    每次文件在附加到之前打开时,它都需要在写入之前查找到底,这可能是因为成本太高了(我不明白为什么这应该是坏的,但是保持它打开就不需要这样做了).

    如果没有实际演示写入文件时出现问题的示例代码,则很难调试此问题。您确定时间延迟是从.to_csv写入开始的,而不是数据帧进程(您没有提供任何代码)。尝试跳过.csv写入,只打印时间,看看是否仍然存在相同的性能问题。@Alexander添加的完整源代码将在几分钟内测试您的建议。根据内存限制,您是否尝试在内存中串联,例如每100个数据帧,然后将批处理数据帧保存到.csv?例如,如果您将每个数据帧的结果附加到列表,然后在导出之前将列表连接到数据帧。是@Alexander I。但这需要更长的时间。最终的csv大小接近6GB。是什么让你认为它是gc?事实上,
    collect
    可能会很慢,你不想在每次迭代中都运行它。如果计数%100==0,您可以尝试在
    块中调用它:
    块。您是否尝试将
    非常深的副本
    全部删除(例如,如我在回答中所建议的)?我认为这是性能方面的最佳选择。如果我理解正确,您的规范化包括将每列除以另一列的最大值。进行复制是因为希望在规范化其他列之前获得最大值。一个简单的解决方法是第一次存储所有规范化常量(最大值),然后修改数据帧。我在回答中更新了提议的代码片段。至于
    to_csv()
    问题,我必须承认我是一个
    columns_to_drop = set(head) & set(exclude)  # maybe also - ['ConcatIndex']
    remaining_cols = set(head) - set(exclude)
    real_cols = [r for r in remaining_cols if 'Real ' in r]
    real_cols_suffix = [r.strip('Real ') for r in real]
    remaining_cols = remaining_cols - real_cols
    child_cols = [r for r in remaining_cols if 'child' in r]
    child_cols_desc = [r.strip('child'+'desc') for r in real]
    remaining_cols = remaining_cols - child_cols
    
    for count, picklefile in enumerate(pickleFiles):
        if count % 100 == 0:
            t2 = datetime.now()
            print(str(t2))
            print('count = ' + str(count))
            print('time: ' + str(t2 - t1) + '\n')
            t1 = t2
    
        #DataFrame Manipulation:
        df = pd.read_pickle(path + picklefile)
    
        df['ConcatIndex'] = 100000*df.FileID + df.ID
        # use apply here rather than iterating
        df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
        df.index = df.ConcatIndex
    
        #DataFrame Normalization:
        dftemp = df.very_deep_copy()  # don't *think* you need this
    
        # drop all excludes
        dftemp.drop(columns_to_drop), axis=1, inplace=True)
    
        # normalize real cols
        m = dftemp[real_cols_suffix].max()
        m.index = real_cols
        dftemp[real_cols] = dftemp[real_cols] / m
    
        # normalize child cols
        m = dftemp[child_cols_desc].max()
        m.index = child_cols
        dftemp[child_cols] = dftemp[child_cols] / m
    
        # normalize remaining
        remaining = list(remaining - child)
        dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()
    
        # if this case is important then discard the rows of m with .max() is 0
        #if max != 0:
        #    dftemp[string] = dftemp[string]/max
    
        # this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
        # dftemp.drop('ConcatIndex', axis=1, inplace=True)
    
        #Saving DataFrame in CSV:
        if picklefile == '0000.p':
            dftemp.to_csv(finalnormCSVFile)
        else:
            dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
    
    # outside of the for loop (so the file is opened and closed only once)
    f = open(finalnormCSVFile, 'w')
    
    ...
    for picklefile in ...
    
        if picklefile == '0000.p':
            dftemp.to_csv(f)
        else:
            dftemp.to_csv(f, mode='a', header=False)
    ...
    
    f.close()