Python/panda逐行写入文件：：内存使用_Python_Pandas

Python/panda逐行写入文件：：内存使用

python pandas

Python/panda逐行写入文件：：内存使用,python,pandas,Python,Pandas,我用Pandas（~9GB）将一个大数据帧加载到内存中。我试图写出一个遵循给定格式（Vowpal Wabbit）的文本文件，但对内存使用和性能感到困惑。虽然文件很大（4800万行），但最初加载到Pandas中的情况并不糟糕。写出这个文件至少需要6个多小时，而且会压碎我的笔记本电脑，消耗掉我几乎所有的内存（32GB）。天真地，我假设这个操作一次只在一条线上运行，所以RAM的使用量非常小。有没有更有效的方法来处理这些数据 with open("C:\\Users\\Desktop\\DATA\\tr

我用Pandas（~9GB）将一个大数据帧加载到内存中。我试图写出一个遵循给定格式（Vowpal Wabbit）的文本文件，但对内存使用和性能感到困惑。虽然文件很大（4800万行），但最初加载到Pandas中的情况并不糟糕。写出这个文件至少需要6个多小时，而且会压碎我的笔记本电脑，消耗掉我几乎所有的内存（32GB）。天真地，我假设这个操作一次只在一条线上运行，所以RAM的使用量非常小。有没有更有效的方法来处理这些数据

with open("C:\\Users\\Desktop\\DATA\\train_mobile2.vw", "wb") as outfile:
    for index, row in train.iterrows():
        if row['click'] ==0:
            vwline=""
            vwline+="-1 "
        else:
            vwline=""
            vwline+="1 "
        vwline+="|a C1_"+ str(row['C1']) +\
        " |b banpos_"+ str(row['banner_pos']) +\
        " |c siteid_"+ str(row['site_id']) +\
        " sitedom_"+ str(row['site_domain']) +\
        " sitecat_"+ str(row['site_category']) +\
        " |d appid_"+ str(row['app_id']) +\
        " app_domain_"+ str(row['app_domain']) +\
        " app_cat_"+ str(row['app_category']) +\
        " |e d_id_"+ str(row['device_id']) +\
        " d_ip_"+ str(row['device_ip']) +\
        " d_os_"+ str(row['device_os']) +\
        " d_make_"+ str(row['device_make']) +\
        " d_mod_"+ str(row['device_model']) +\
        " d_type_"+ str(row['device_type']) +\
        " d_conn_"+ str(row['device_conn_type']) +\
        " d_geo_"+ str(row['device_geo_country']) +\
        " |f num_a:"+ str(row['C17']) +\
        " numb:"+ str(row['C18']) +\
        " numc:"+ str(row['C19']) +\
        " numd:"+ str(row['C20']) +\
        " nume:"+ str(row['C22']) +\
        " numf:"+ str(row['C24']) +\
        " |g c21_"+ str(row['C21']) +\
        " C23_"+ str(row['C23']) +\
        " |h hh_"+ str(row['hh']) +\
        " |i doe_"+ str(row['doe']) 
        outfile.write(vwline + "\n")

针对用户的建议

我编写了以下代码，但当它运行的最后一行显示“不支持+”的操作数类型时，出现了一个错误：“numpy.ndarray”和“str”

行到csv（“C:\Users\Desktop\DATA\KAGGLE\mobile\train\u mobile.vw”，mode='a'，header=False，index=False）

不确定内存使用情况，但这肯定会更快：

lines = np.where(train['click'] == 0, "-1 ", "1 ") +
        "|a C1_" + train['C1'].astype('str') +
        " |b banpos_"+ train['banner_pos'].astype('str') +
        ...

然后保存这些行

lines.to_csv(outfile, index=False)

如果内存出现问题，您也可以成批执行（比如说一次执行几百万条记录）

是否正在使用已知效率低下的ItErrors（）？一般来说，矢量化操作速度更快，因为它们经过了高度优化。ItErrors较慢，但如果这是导致内存膨胀的原因，我会感到惊讶。我重新编写了此代码，但在最后一行中出现了错误-我将补充上面的问题。需要像在pd.Series（np.where（…）中一样用pd.Series（）包装np.where（）

lines.to_csv(outfile, index=False)