Python 从数据帧中删除行时出现内存错误_Python_Performance_Pandas_Large Data

Python 从数据帧中删除行时出现内存错误

python performance pandas

Python 从数据帧中删除行时出现内存错误,python,performance,pandas,large-data,Python,Performance,Pandas,Large Data,我有一个大数据框架，9000人的2600万行数据。索引不是有序的我需要分别对每个人的数据进行一些计算，并将其保存到新的数据框中，每人一行我写了一个关于唯一人物id的循环，提取了小数据帧该人员的数据，对其运行计算，并将结果保存到预定义的数据帧。计算主要是和除运算在特定条件下的列上这在AmazonLinux服务器上花费了大约一个小时。不实际为了提高效率，我尝试删除当前的人从数据帧开始，因此数据帧大小的减小将提高效率。经过2-4步后，我出现了内存错误我设法在我的笔记本电脑窗口上

我有一个大数据框架，9000人的2600万行数据。索引不是有序的

我需要分别对每个人的数据进行一些计算，并将其保存到新的数据框中，每人一行

我写了一个关于唯一人物id的循环，提取了小数据帧该人员的数据，对其运行计算，并将结果保存到预定义的数据帧。计算主要是和除运算在特定条件下的列上

这在AmazonLinux服务器上花费了大约一个小时。不实际

为了提高效率，我尝试删除当前的人从数据帧开始，因此数据帧大小的减小将提高效率。经过2-4步后，我出现了内存错误

我设法在我的笔记本电脑窗口上重建了问题。只有当数据帧大小足够大时，才会出现问题。在我的笔记本电脑上从4000000。重置索引，解决了此大小的问题，但内存问题在更大的40000000大小中重复出现

我是大数据集的新手，也是熊猫的新手，任何想法都会被采纳

import numpy as np
import pandas as pd
import time
import random
random.seed(50)
np.random.seed(50)
size = 4000000
dtype = [('view_day', 'int32'), ('account', 'int32'),('category', 'int32'),
         ('Col1I','int32'), 
         ('Col2I','int32'),('Col3I','int32'),
         ('Col4F','float32'), ('Col5F','float32'), ('Col6F','float32'),
        ('isFull','int32'), ('islong','int32')]
values = np.ones(size, dtype=dtype)

index = np.arange(size)

np.random.shuffle(index)

df = pd.DataFrame(values, index=index)
df['view_day'] = np.random.randint(7605, 7605 + 180, df.shape[0])
df['account'] = np.random.randint(1548051, 1548051 + 100, df.shape[0])
df['category'] = np.random.randint(1, 5, df.shape[0])
df['Col1I'] = np.random.randint(600, 1200, df.shape[0])
df['Col2I'] = np.random.randint(1, 600, df.shape[0])

accounts= df.account.unique()

for w in accounts:
    dfs = df[df.account == w]#.copy() - both versions causing memory error

    print dfs.shape
    print df.shape
    df.drop(dfs.index, inplace=True)


---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-32-7c390ede93df> in <module>()
      4     print dfs.shape
      5     print df.shape
----> 6     df.drop(dfs.index, inplace=True)

C:\Users\naomi\Anaconda2\lib\site-packages\pandas\core\generic.pyc in drop(self, labels, axis, level, inplace, errors)
   1876             else:
   1877                 new_axis = axis.drop(labels, errors=errors)
-> 1878             dropped = self.reindex(**{axis_name: new_axis})
   1879             try:
   1880                 dropped.axes[axis_].set_names(axis.names, inplace=True)

C:\Users\naomi\Anaconda2\lib\site-packages\pandas\core\frame.pyc in reindex(self, index, columns, **kwargs)
   2739     def reindex(self, index=None, columns=None, **kwargs):
   2740         return super(DataFrame, self).reindex(index=index, columns=columns,
-> 2741                                               **kwargs)
   2742 
   2743     @Appender(_shared_docs['reindex_axis'] % _shared_doc_kwargs)

C:\Users\naomi\Anaconda2\lib\site-packages\pandas\core\generic.pyc in reindex(self, *args, **kwargs)
   2227         # perform the reindex on the axes
   2228         return self._reindex_axes(axes, level, limit, tolerance, method,
-> 2229                                   fill_value, copy).__finalize__(self)
   2230 
   2231     def _reindex_axes(self, axes, level, limit, tolerance, method, fill_value,

C:\Users\naomi\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
   2685         if index is not None:
   2686             frame = frame._reindex_index(index, method, copy, level,
-> 2687                                          fill_value, limit, tolerance)
   2688 
   2689         return frame

C:\Users\naomi\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _reindex_index(self, new_index, method, copy, level, fill_value, limit, tolerance)
   2696         return self._reindex_with_indexers({0: [new_index, indexer]},
   2697                                            copy=copy, fill_value=fill_value,
-> 2698                                            allow_dups=False)
   2699 
   2700     def _reindex_columns(self, new_columns, copy, level, fill_value=NA,

C:\Users\naomi\Anaconda2\lib\site-packages\pandas\core\generic.pyc in _reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
   2339                                                 fill_value=fill_value,
   2340                                                 allow_dups=allow_dups,
-> 2341                                                 copy=copy)
   2342 
   2343         if copy and new_data is self._data:

C:\Users\naomi\Anaconda2\lib\site-packages\pandas\core\internals.pyc in reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy)
   3595             new_blocks = [blk.take_nd(indexer, axis=axis, fill_tuple=(
   3596                 fill_value if fill_value is not None else blk.fill_value,))
-> 3597                 for blk in self.blocks]
   3598 
   3599         new_axes = list(self.axes)

C:\Users\naomi\Anaconda2\lib\site-packages\pandas\core\internals.pyc in take_nd(self, indexer, axis, new_mgr_locs, fill_tuple)
    994             fill_value = fill_tuple[0]
    995             new_values = algos.take_nd(values, indexer, axis=axis,
--> 996                                        allow_fill=True, fill_value=fill_value)
    997 
    998         if new_mgr_locs is None:

C:\Users\naomi\Anaconda2\lib\site-packages\pandas\core\algorithms.pyc in take_nd(arr, indexer, axis, out, fill_value, mask_info, allow_fill)
    928             out = np.empty(out_shape, dtype=dtype, order='F')
    929         else:
--> 930             out = np.empty(out_shape, dtype=dtype)
    931 
    932     func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,

MemoryError:

将numpy导入为np
作为pd进口熊猫
导入时间
随机输入
随机。种子（50）
np.随机种子（50）
尺寸=4000000
数据类型=[（'view_day'，'int32'），（'account'，'int32'），（'category'，'int32'），
（'Col1I'，'int32'），
（'Col2I'，'int32'），（'Col3I'，'int32'），
（'Col4F'，'float32'），（'Col5F'，'float32'），（'Col6F'，'float32'），
（'isFull'，'int32'），（'islong'，'int32'）]
值=np.ones（大小，dtype=dtype）
索引=np.arange（大小）
np.random.shuffle（索引）
df=pd.DataFrame（值，索引=index）
df['view_day']=np.random.randint（76057605+180，df.shape[0]）
df['account']=np.random.randint（15480511548051+100，df.shape[0]）
df['category']=np.random.randint（1,5，df.shape[0]）
df['Col1I']=np.random.randint（6001200，df.shape[0]）
df['Col2I']=np.random.randint（1600，df.shape[0]）
accounts=df.account.unique（）
对于w in账户：
dfs=df[df.account==w]#.copy（）-两个版本都会导致内存错误
打印dfs.shape
打印df.shape
df.drop（dfs.index，inplace=True）
---------------------------------------------------------------------------
MemoryError回溯（上次最近调用）
在（）
4.打印dfs.shape
5.打印df.shape
---->6 df.drop（dfs.index，原地=真）
C:\Users\naomi\Anaconda2\lib\site packages\pandas\core\generic.pyc放置（self、labels、axis、level、inplace、errors）
1876其他：
1877新_轴=轴。下拉（标签，错误=错误）
->1878 drop=self.reindex（**{axis\u name:new\u axis}）
1879尝试：
1880已删除。轴[axis_u]。设置_名称（axis.names，inplace=True）
reindex中的C:\Users\naomi\Anaconda2\lib\site packages\pandas\core\frame.pyc（self、index、columns、**kwargs）
2739 def reindex（自身，索引=无，列=无，**kwargs）：
2740返回super（DataFrame，self）.reindex（index=index，columns=columns，
->2741**夸尔格）
2742
2743@Appender（_shared_docs['reindex_axis']%_shared_doc_kwargs）
reindex中的C:\Users\naomi\Anaconda2\lib\site packages\pandas\core\generic.pyc（self，*args，**kwargs）
2227#在轴上执行重新索引
2228返回自重新索引轴（轴、水平、极限、公差、方法、，
->2229填写值，复制）。\uuuuuu完成\uuuuuuuuu（自我）
2230
2231定义重新索引轴（自身、轴、水平、限制、公差、方法、填充值、，
C:\Users\naomi\Anaconda2\lib\site packages\pandas\core\frame.pyc in\u reindex\u axes（self、axes、level、limit、tolerance、method、fill\u value、copy）
2685如果索引不是无：
2686 frame=frame.\u reindex\u index（索引、方法、副本、级别、，
->2687填充（U值、极限、公差）
2688
2689返回帧
索引中的C:\Users\naomi\Anaconda2\lib\site packages\pandas\core\frame.pyc（self、new\u索引、方法、副本、级别、填充值、限制、公差）
2696返回self.\u使用\u索引器（{0:[新的\u索引，索引器]}）重新索引\u，
2697复制=复制，填充值=填充值，
->2698允许（重复=错误）
2699
2700个定义重新索引列（自我、新列、复制、级别、填充值=NA，
C:\Users\naomi\Anaconda2\lib\site packages\pandas\core\generic.pyc in_reindex_与索引器（self、reindexer、fill_value、copy、allow_dups）
2339填充值=填充值，
2340允许重复=允许重复，
->2341份=份）
2342
2343如果复制和新建数据为自身数据：
reindex\u索引器中的C:\Users\naomi\Anaconda2\lib\site packages\pandas\core\internals.pyc（self、new\u axis、indexer、axis、fill\u value、allow\u dups、copy）
3595新块=[blk.take\U nd（索引器，轴=轴，填充元组=(
3596填充值（如果填充值不是None-else blk。填充值，））
->3597适用于自住街区的blk]
3598
3599新_轴=列表（自轴）
C:\Users\naomi\Anaconda2\lib\site packages\pandas\core\internals.pyc in take\nd（self、indexer、axis、new\u mgr\u locs、fill\u tuple）
994填充值=填充元组[0]
995新值=算法。取值（值，索引器，轴=轴，
-->996允许填充=真，填充值=填充值）
997
998如果新经理没有：
C:\Users\naomi\Anaconda2\lib\site packages\pandas\core\algorithms.pyc输入输出（arr、索引器、轴、输出、填充值、掩码信息、允许填充）
928 out=np.empty（out_-shape，dtype=dtype，order='F'）
929其他：
-->930 out=np.empty（out_形，dtype=dtype）
931