Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/335.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Dask在连接大数据帧时效率不高,并导致内存错误_Python_Pandas_Dask - Fatal编程技术网

Python Dask在连接大数据帧时效率不高,并导致内存错误

Python Dask在连接大数据帧时效率不高,并导致内存错误,python,pandas,dask,Python,Pandas,Dask,首先,我尝试了熊猫数据帧的典型连接: df=pd.concat([df,df_filtered2],axis=1,sort=False) 但它给出了一个错误: /home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomp

首先,我尝试了熊猫数据帧的典型连接:

df=pd.concat([df,df_filtered2],axis=1,sort=False)
但它给出了一个错误:

/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
Traceback (most recent call last):
  File "process_data_interpolation.py", line 435, in <module>
    df=pd.concat([df,df_filtered2],axis=1,sort=False)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 255, in concat
    sort=sort,
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 335, in __init__
    obj._consolidate(inplace=True)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5270, in _consolidate
    self._consolidate_inplace()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5252, in _consolidate_inplace
    self._protect_consolidate(f)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5241, in _protect_consolidate
    result = f()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5250, in f
    self._data = self._data.consolidate()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 932, in consolidate
    bm._consolidate_inplace()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 937, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1913, in _consolidate
    list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3323, in _merge_blocks
    new_values = new_values[argsort]
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (41, 156082680) and data type float64
但它也给了我一个回忆:

/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
Traceback (most recent call last):
  File "process_data_interpolation.py", line 443, in <module>
    df = dd.concat([df,df_filtered2],axis=1)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/multi.py", line 1045, in concat
    dfs = _maybe_from_pandas(dfs)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in _maybe_from_pandas
    for df in dfs
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in <listcomp>
    for df in dfs
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in from_pandas
    for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in <dictcomp>
    for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1424, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1308, in _get_slice_axis
    return self._slice(indexer, axis=axis, kind="iloc")
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 166, in _slice
    return self.obj._slice(obj, axis=axis, kind=kind)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 3371, in _slice
    result = self._constructor(self._data.get_slice(slobj, axis=axis))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 755, in get_slice
    bm._consolidate_inplace()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 937, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1913, in _consolidate
    list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3323, in _merge_blocks
    new_values = new_values[argsort]
MemoryError: Unable to allocate array with shape (41, 156082680) and data type float64
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/\uuuuuuuuu-init\uuuuuuuuuuuuuuuu.py:84:UserWarning:无法导入lzma模块。您安装的Python不完整。尝试使用lzma压缩将导致运行时错误。
警告。警告(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/_-init___;.py:84:UserWarning:无法导入lzma模块。您安装的Python不完整。尝试使用lzma压缩将导致运行时错误。
警告。警告(msg)
回溯(最近一次呼叫最后一次):
文件“process_data_interpolation.py”,第443行,在
df=dd.concat([df,df_filtered2],轴=1)
concat中的文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/dask/dataframe/multi.py”,第1045行
dfs=\u可能来自\u熊猫(dfs)
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/dask/dataframe/core.py”,第4465行,在\u maybe\u中
dfs中的df
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/dask/dataframe/core.py”,第4465行,在
dfs中的df
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/dask/dataframe/io/io.py”,第209行,来自
对于枚举(zip(位置[:-1],位置[1:])中的i,(开始,停止)
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/dask/dataframe/io/io.py”,第209行,在
对于枚举(zip(位置[:-1],位置[1:])中的i,(开始,停止)
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/pandas/core/index.py”,第1424行,在__
返回self.\u getitem\u axis(可能可调用,axis=axis)
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/pandas/core/index.py”,第2137行,在_getitem_轴中
返回自我。获取切片轴(键,轴=轴)
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/pandas/core/index.py”,第1308行,在获取切片轴中
返回自切面(索引器,轴=轴,种类=“iloc”)
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/pandas/core/index.py”,第166行,在
返回self.obj.\u切片(obj,轴=轴,种类=种类)
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/pandas/core/generic.py”,第3371行,在
结果=self.\u构造函数(self.\u data.get\u切片(slobj,axis=axis))
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/pandas/core/internals/managers.py”,第755行,在get_切片中
bm._巩固_到位()
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py”,第937行,位于合并位置
self.blocks=元组(_合并(self.blocks))
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site packages/pandas/core/internals/managers.py”,第1913行,在
列表(组块),dtype=dtype,\u可以\u合并=\u可以\u合并
文件“/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py”,第3323行,在块中合并
新值=新值[argsort]
MemoryError:无法分配形状为(41156082680)且数据类型为float64的数组

我还能试什么?我正在linux节点上运行Python脚本,内存为128GB。在我的例子中,删除不必要的列并将某些列转换为整数后,其中一个数据帧的大小为44.48 GB。

Dask最佳实践文档中回答了这个问题:


也许交换文件可以帮助您。是否可以合并而不是合并?然后,在使用
.compute()
后,似乎出现了错误。如果尝试将
保存到\u parquet
而不是compute,该怎么办?
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
Traceback (most recent call last):
  File "process_data_interpolation.py", line 443, in <module>
    df = dd.concat([df,df_filtered2],axis=1)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/multi.py", line 1045, in concat
    dfs = _maybe_from_pandas(dfs)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in _maybe_from_pandas
    for df in dfs
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in <listcomp>
    for df in dfs
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in from_pandas
    for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in <dictcomp>
    for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1424, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1308, in _get_slice_axis
    return self._slice(indexer, axis=axis, kind="iloc")
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 166, in _slice
    return self.obj._slice(obj, axis=axis, kind=kind)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 3371, in _slice
    result = self._constructor(self._data.get_slice(slobj, axis=axis))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 755, in get_slice
    bm._consolidate_inplace()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 937, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1913, in _consolidate
    list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3323, in _merge_blocks
    new_values = new_values[argsort]
MemoryError: Unable to allocate array with shape (41, 156082680) and data type float64