Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/318.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 数据帧重新索引对象不必要地保留在内存中_Python_Pandas_Ipython_Ipython Notebook - Fatal编程技术网

Python 数据帧重新索引对象不必要地保留在内存中

Python 数据帧重新索引对象不必要地保留在内存中,python,pandas,ipython,ipython-notebook,Python,Pandas,Ipython,Ipython Notebook,在本文的继续部分中,我实现了两个函数,一个使用重新索引,另一个不使用。第3行中的功能有所不同: def update(centroid): best_mean_dist = 200 clust_members = members_by_centeriod[centroid] for member in clust_members: member_mean_dist = 100 - df.ix[member].ix[clust_members].score.

在本文的继续部分中,我实现了两个函数,一个使用重新索引,另一个不使用。第3行中的功能有所不同:

def update(centroid):
    best_mean_dist = 200
    clust_members = members_by_centeriod[centroid]
    for member in clust_members:
        member_mean_dist = 100 - df.ix[member].ix[clust_members].score.mean()

        if member_mean_dist<best_mean_dist:
            best_mean_dist = member_mean_dist
            centroid = member
    return centroid,best_mean_dist

def update1(centroid):
    best_mean_dist = 200
    members_in_clust = members_by_centeriod[centroid]
    new_df = df.reindex(members_in_clust, level=0).reindex(members_in_clust, level=1)
    for member in members_in_clust:
        member_mean_dist = 100 - new_df.ix[member].ix[members_in_clust].score.mean()        

        if member_mean_dist<best_mean_dist:
            best_mean_dist = member_mean_dist
            centroid = member
    return centroid,best_mean_dist  
数据帧df是一个大型数据帧,大约有400万行,占用约300MB内存

使用重新索引的update1函数要快得多。但是,一些意想不到的事情发生了——当运行带有重新索引的内存时,经过几次迭代,内存从~300MB迅速增加到1.5GB,然后我得到了内存冲突

update函数不受这种行为的影响。2件我没有得到的东西:

很明显,重新编制索引会产生副本。但是,每次update1函数完成时,该副本不是都会消失吗?newdf变量应该随创建它的函数一起消失。。对吧?

即使垃圾收集器没有立即杀死newdf,一个内存耗尽,它也应该杀死它,而不是引发outOfMemory异常,对吗

我试图在update1函数的末尾手动添加del newdf,但没有帮助。那么,这是否表明该bug实际上正在重新编制索引过程中呢

编辑:

我发现了问题,但我不明白这种行为的原因是什么。它是python垃圾收集器,拒绝清理重新索引的数据帧。 这是有效的:

for i in range(2000):
   new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
这也是有效的:

def reindex():
    new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
    score  = 100 - new_df.ix[member].ix[clust_members].score.mean()
    return score

for i in range(2000):
    reindex()
这会导致在内存中重新索引对象:


我认为我的用法是天真正确的。newdf变量是如何与score值保持连接的,为什么?

这是我的调试代码,当你建立索引时,Index对象将创建元组和引擎映射,我认为内存是由这两个缓存对象使用的。如果我添加了标有****的行,那么内存增长非常小,在我的电脑上大约为6M:

import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc

def get_memory():
    pid = os.getpid()
    p = psutil.Process(pid)
    return p.get_memory_info().rss

def get_object_ids():
    return set(id(obj) for obj in gc.get_objects())

m1 = get_memory()

n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])

ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))

m2 = get_memory()
objs1 = get_object_ids()

z = []
for i in range(5):
    df2 = df.reindex(ix, level=0).reindex(iy, level=1)
    z.append(df2.mean().mean())
df.index._tuples = None    # ****
df.index._cleanup()        # ****
del df2
gc.collect()               # ****
m3 = get_memory()

print (m2-m1)/1e6, (m3-m2)/1e6

from collections import Counter

counter = Counter()
for obj in gc.get_objects():
    if id(obj) not in objs1:
        typename = type(obj).__name__
        counter[typename] += 1
print counter

请把你的python和pandas版本放上去;你跑的是64,对吗?您至少有4gb的内存,运行32位的“0.12.0”和8GB内存。但似乎并不是重新编制索引造成了问题。请看我下面的评论..32位限制为2GB的实际空间和4GB的总内存-额外的内存不起任何作用;实际上,由于python通常需要一个连续的空间,所以分配的时间很难超过1GB。您将在64位上取得更大的成功我发现了问题并编辑了问题底部,您的想法将得到高度重视您的更新:您每次都在创建一个新对象,因为您持有对每个对象的引用,您的内存将无限增长。此代码不会造成内存泄漏。此外,我的reidnexing使用循环,不会造成内存泄漏。这也不是ipython笔记本的问题,因为在WING中运行代码也会导致内存泄漏。因此,它必须与我的用法或python垃圾收集器有关。我正在研究它,非常感谢您的建议。我发现了问题并编辑了问题的底部,您的想法将得到高度赞赏。谢谢,我会尝试一下,并让您知道。但是,对于为什么垃圾收集器不清理数据帧,您有什么见解吗?
z = []    
for i in range(2000):
    z.append(reindex()) 
import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc

def get_memory():
    pid = os.getpid()
    p = psutil.Process(pid)
    return p.get_memory_info().rss

def get_object_ids():
    return set(id(obj) for obj in gc.get_objects())

m1 = get_memory()

n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])

ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))

m2 = get_memory()
objs1 = get_object_ids()

z = []
for i in range(5):
    df2 = df.reindex(ix, level=0).reindex(iy, level=1)
    z.append(df2.mean().mean())
df.index._tuples = None    # ****
df.index._cleanup()        # ****
del df2
gc.collect()               # ****
m3 = get_memory()

print (m2-m1)/1e6, (m3-m2)/1e6

from collections import Counter

counter = Counter()
for obj in gc.get_objects():
    if id(obj) not in objs1:
        typename = type(obj).__name__
        counter[typename] += 1
print counter