Python 减少内存占用？_Python_Pandas_Memory_Dataframe_Categorical Data

Python 减少内存占用？

python pandas memory dataframe

Python 减少内存占用？,python,pandas,memory,dataframe,categorical-data,Python,Pandas,Memory,Dataframe,Categorical Data,刚刚启动了一个Jupyter终端，并将一个Excel文件（~12MB）加载到一个数据框中加载文件之前： >> import resource >> print 'Memory usage: %s (Mb)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024) >> import pandas as pd >> df = pd.read_excel('/var/www/temp_

刚刚启动了一个Jupyter终端，并将一个Excel文件（~12MB）加载到一个数据框中

加载文件之前：

>> import resource
>> print 'Memory usage: %s (Mb)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024)

>> import pandas as pd
>> df = pd.read_excel('/var/www/temp_test_files/stackoverflow_survey_2016.xlsx')
>> print 'Memory usage: %s (Mb)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024)

内存使用率：40（Mb）

将文件加载到数据帧后：

>> import resource
>> print 'Memory usage: %s (Mb)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024)

>> import pandas as pd
>> df = pd.read_excel('/var/www/temp_test_files/stackoverflow_survey_2016.xlsx')
>> print 'Memory usage: %s (Mb)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024)

内存使用率：193（Mb）

为什么在pandas中加载一个12Mb的文件时，它占用的内存比实际大小的12倍多150mb

以下列数据类型的详细分类。我猜对象数据类型分配的内存比列的实际使用量要多

>> df.info(memory_usage=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56030 entries, 0 to 56029
Data columns (total 57 columns):
collector                   56030 non-null object
country                     55528 non-null object
un_subregion                55313 non-null object
so_region                   55390 non-null object
age_range                   55727 non-null object
age_midpoint                55336 non-null float64
gender                      55586 non-null object
self_identification         54202 non-null object
occupation                  49519 non-null object
occupation_group            46934 non-null object
experience_range            49520 non-null object
experience_midpoint         49520 non-null float64
salary_range                46121 non-null object
salary_midpoint             41742 non-null float64
programming_ability         46982 non-null float64
employment_status           49576 non-null object
industry                    40110 non-null object
company_size_range          39932 non-null object
team_size_range             39962 non-null object
women_on_team               39808 non-null object
remote                      40118 non-null object
job_satisfaction            40110 non-null object
job_discovery               40027 non-null object
commit_frequency            46598 non-null object
hobby                       46673 non-null object
dogs_vs_cats                45239 non-null object
desktop_os                  46451 non-null object
unit_testing                46657 non-null object
rep_range                   46143 non-null object
visit_frequency             46154 non-null object
why_learn_new_tech          46145 non-null object
education                   44955 non-null object
open_to_new_job             44380 non-null object
new_job_value               43658 non-null object
job_search_annoyance        42851 non-null object
interview_likelihood        42263 non-null object
star_wars_vs_star_trek      34398 non-null object
agree_tech                  42662 non-null object
agree_notice                42755 non-null object
agree_problemsolving        42659 non-null object
agree_diversity             42505 non-null object
agree_adblocker             42627 non-null object
agree_alcohol               42692 non-null object
agree_loveboss              42096 non-null object
agree_nightcode             42613 non-null object
agree_legacy                42382 non-null object
agree_mars                  42685 non-null object
important_variety           42628 non-null object
important_control           42572 non-null object
important_sameend           42531 non-null object
important_newtech           42604 non-null object
important_buildnew          42538 non-null object
important_buildexisting     42580 non-null object
important_promotion         42483 non-null object
important_companymission    42529 non-null object
important_wfh               42582 non-null object
important_ownoffice         42538 non-null object
dtypes: float64(4), object(53)
memory usage: 24.8+ MB
None

df.info（内存使用率=True） INT64索引：56030个条目，0到56029 数据列（共57列）：收集器56030非空对象国家55528非空对象 un_55313非空对象 so_区域55390非空对象年龄范围55727非空对象年龄单位中点55336非空浮点64 性别55586非空对象自我识别54202非空对象占用49519非空对象职业组46934非空对象经验_范围49520非空对象经验_中点49520非空浮点64 薪资范围46121非空对象薪资_中点41742非空浮动64 编程能力46982非空浮点64 就业状态49576非空对象工业40110非空对象公司大小范围39932非空对象团队大小范围39962非空对象女子组39808非空对象远程40118非空对象作业满意度40110非空对象作业发现40027非空对象提交频率46598非空对象 hobby 46673非空对象 dogs_vs_cats 45239非空对象桌面操作系统46451非空对象单元测试46657非空对象代表范围46143非空对象访问频率46154非空对象为什么要学习新技术46145非空对象教育44955非空对象打开到新作业44380非空对象新的_作业_值43658非空对象作业搜索烦恼42851非空对象访谈42263非空对象星球大战与星际迷航34398非空对象 agree_tech 42662非空对象同意通知42755非空对象同意问题解决42659非空对象 agree_多样性42505非空对象 agree_adblocker 42627非空对象 agree_42692非空对象 agree_loveboss 42096非空对象 agree_nightcode 42613非空对象 agree_legacy 42382非空对象 agree_mars 42685非空对象重要信息42628非空对象重要信息\u控制42572非空对象重要信息\u sameend 42531非空对象重要信息新科42604非空对象重要提示：新建42538非空对象重要提示：现有42580非空对象重要提示42483非空对象重要公司任务42529非空对象重要信息\u wfh 42582非空对象重要信息：办公室42538非空对象数据类型：float64（4），object（53）内存使用率：24.8+MB 没有一个是否有任何“最佳实践”方法来减少数据帧的实际内存占用

修补数据类型
分类

为什么加载到pandas中的12Mb文件会占用更多的150mb空间超过内存中实际大小的12倍

该12MB Excel文件为压缩格式。文件中原始数据的实际大小可能是该值的5倍或10倍。如果您只需将文件重命名为.zip并提取其内容，就可以验证这一点

是否有任何“最佳实践”方法来减少实际内存熊猫数据帧的足迹？修补数据类型？分类

是的，使用

dtypes

、

category

数据类型和向下广播是最好的方法

由于大多数列都是string类型，因此减少内存的最佳选择是使用

dtypes

为低基数列指定显式

category

类型。例如，国家、地区、性别、地位

您的所有4个

float64

列也可以减少为无符号32位数据类型（或更低：如年龄、薪水）