Python MemoryError将两个数据帧与pandas和Dask合并——我如何才能做到这一点?

Python MemoryError将两个数据帧与pandas和Dask合并——我如何才能做到这一点?,python,pandas,merge,out-of-memory,dask,Python,Pandas,Merge,Out Of Memory,Dask,我有两个熊猫数据帧。我想合并这两个数据帧,但我一直遇到内存错误。我可以使用什么样的工作环境 以下是设置: import pandas as pd df1 = pd.read_cvs("first1.csv") df2 = pd.read_csv("second2.csv") print(df1.shape) # output: (4757076, 4) print(df2.shape) # output: (428764, 45) df1.head column1 begin

我有两个熊猫数据帧。我想合并这两个数据帧,但我一直遇到内存错误。我可以使用什么样的工作环境

以下是设置:

import pandas as pd

df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
print(df1.shape) # output: (4757076, 4)
print(df2.shape) # output: (428764, 45)


df1.head 

    column1  begin    end    category
0  class1  10001  10468    third
1  class1  10469  11447     third
2  class1  11505  11675     fourth
3  class2  15265  15355   seventh
4  class2  15798  15849   second


print(df2.shape) # (428764, 45)
   column1  begin    .... 
0  class1  10524   .... 
1  class1  10541   ....
2  class1  10549  ....
3  class1  10565  ...
4  class1  10596  ...
我只想在“column1”上合并这两个数据帧。但是,这总是会导致内存错误


让我们首先在pandas中进行尝试,在一个具有大约2 TB RAM和数百个线程的系统上:

import pandas as pd
df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
merged = pd.merge(df1, df2, on="column1", how="outer", suffixes=("","_repeated")
下面是我得到的错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
    sort=self.sort, how=self.how)
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas/src/join.pyx", line 160, in pandas.algos.full_outer_join (pandas/algos.c:61256)
MemoryError

That didn't work. Let's try with dask:


import pandas as pd
import dask.dataframe as dd
from numpy import nan


ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)

merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)

Here's the error I get:

Traceback (most recent call last):
  File "repeat_finder.py", line 15, in <module>
    merged = dd.merge(ddf1, ddf2,on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
  File "/path/python3.5/site-packages/dask/base.py", line 78, in compute
    return compute(self, **kwargs)[0]
  File "/path/python3.5/site-packages/dask/base.py", line 178, in compute
    results = get(dsk, keys, **kwargs)
  File "/path/python3.5/site-packages/dask/threaded.py", line 69, in get
    **kwargs)
  File "/path/python3.5/site-packages/dask/async.py", line 502, in get_async
    raise(remote_exception(res, tb))
dask.async.MemoryError: 

Traceback
---------
  File "/path/python3.5/site-packages/dask/async.py", line 268, in execute_task
    result = _execute_task(task, data)
  File "/path/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/path/python3.5/site-packages/dask/dataframe/methods.py", line 221, in merge
    suffixes=suffixes, indicator=indicator)
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 59, in merge
    return op.get_result()
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 503, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 667, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 647, in _get_join_indexers
    how=self.how)
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 876, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas/src/join.pyx", line 226, in pandas._join.full_outer_join (pandas/src/join.c:11286)
  File "pandas/src/join.pyx", line 231, in pandas._join._get_result_indexer (pandas/src/join.c:11474)
  File "path/python3.5/site-packages/pandas/core/algorithms.py", line 1072, in take_nd
    out = np.empty(out_shape, dtype=dtype, order='F')
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/nfs/sw/python/python-3.5.1/lib/python3.5/site packages/pandas/tools/merge.py”,第39行,合并中
返回操作获取结果()
文件“/nfs/sw/python/python-3.5.1/lib/python3.5/site packages/pandas/tools/merge.py”,第217行,在get_result中
加入索引,左索引器,右索引器=self.\u获取加入信息()
文件“/nfs/sw/python/python-3.5.1/lib/python3.5/site packages/pandas/tools/merge.py”,第353行,在获取加入信息中
sort=self.sort,how=self.how)
文件“/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py”,第559行,在获取加入索引器中
返回连接函数(lkey、rkey、count、**kwargs)
文件“pandas/src/join.pyx”,第160行,在pandas.algos.full_outer_join(pandas/algos.c:61256)中
记忆者
那没用。让我们试试dask:
作为pd进口熊猫
将dask.dataframe作为dd导入
来自numpy INPORT nan
ddf1=dd.from_熊猫(df1,npartitions=2)
ddf2=dd.from_熊猫(df2,npartitions=2)
merged=dd.merge(ddf1,ddf2,on=“column1”,how=“outer”,后缀=(“”,“\u repeat”)。计算(num\u workers=60)
下面是我得到的错误:
回溯(最近一次呼叫最后一次):
文件“repeat_finder.py”,第15行,在
merged=dd.merge(ddf1,ddf2,on=“column1”,how=“outer”,后缀=(“”,“\u repeat”)。计算(num\u workers=60)
文件“/path/python3.5/site packages/dask/base.py”,第78行,在compute中
返回计算(自身,**kwargs)[0]
文件“/path/python3.5/site packages/dask/base.py”,第178行,在compute中
结果=获取(dsk、键、**kwargs)
get中的文件“/path/python3.5/site packages/dask/threaded.py”,第69行
**kwargs)
文件“/path/python3.5/site packages/dask/async.py”,第502行,在get\u async中
raise(远程_异常(res,tb))
dask.async.MemoryError:
回溯
---------
文件“/path/python3.5/site packages/dask/async.py”,第268行,在执行任务中
结果=_执行_任务(任务、数据)
文件“/path/python3.5/site packages/dask/async.py”,第249行,在执行任务中
返回函数(*args2)
文件“/path/python3.5/site packages/dask/dataframe/methods.py”,第221行,合并中
后缀=后缀,指标=指标)
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第59行,合并中
返回操作获取结果()
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第503行,在get_结果中
加入索引,左索引器,右索引器=self.\u获取加入信息()
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第667行,在获取加入信息中
右索引器)=self.\u获取\u加入\u索引器()
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第647行,在索引器中
how=自我,how)
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第876行,在索引器中
返回连接函数(lkey、rkey、count、**kwargs)
文件“pandas/src/join.pyx”,第226行,pandas.\u join.full\u outer\u join(pandas/src/join.c:11286)
文件“pandas/src/join.pyx”,第231行,在pandas.\u join.\u获取结果\u索引器(pandas/src/join.c:11474)中
文件“path/python3.5/site packages/pandas/core/algorithms.py”,第1072行,在take\n中
out=np.empty(out\u-shape,dtype=dtype,order='F')
我怎样才能让它工作,即使它毫无效率可言

编辑:为了回应关于合并两列/索引的建议,我认为我不能这样做。以下是我尝试运行的代码:

import pandas as pd
import dask.dataframe as dd

df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")

ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)

merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
merged = merged[(ddf1.column1 == row.column1) & (ddf2.begin >= ddf1.begin) & (ddf2.begin <= ddf1.end)]
merged = dd.merge(ddf2, merged, on = ["column1"]).compute(num_workers=60)
merged.to_csv("output.csv", index=False)
将熊猫作为pd导入
将dask.dataframe作为dd导入
df1=pd.read_cvs(“first1.csv”)
df2=pd.read\U csv(“second2.csv”)
ddf1=dd.from_熊猫(df1,npartitions=2)
ddf2=dd.from_熊猫(df2,npartitions=2)
merged=dd.merge(ddf1,ddf2,on=“column1”,how=“outer”,后缀=(“”,“\u repeat”)。计算(num\u workers=60)

merged=merged[(ddf1.column1==row.column1)&(ddf2.begin>=ddf1.begin)&(ddf2.begin您不能只合并
column1
上的两个数据帧,因为
column1
不是两个数据帧中每个实例的唯一标识符。请尝试:

merged = pd.merge(df1, df2, on=["column1", "begin"], how="outer", suffixes=("","_repeated"))
如果在
df2
中还有
end
列,则可能需要尝试:

merged = pd.merge(df1, df2, on=["column1", "begin", "end"], how="outer", suffixes=("","_repeated"))

您不能只合并
column1
上的两个数据框,因为
column1
不是两个数据框中每个实例的唯一标识符。请尝试:

merged = pd.merge(df1, df2, on=["column1", "begin"], how="outer", suffixes=("","_repeated"))
如果在
df2
中还有
end
列,则可能需要尝试:

merged = pd.merge(df1, df2, on=["column1", "begin", "end"], how="outer", suffixes=("","_repeated"))

“大约2 TB的RAM和数百个线程”-wowsers。首先,你是在linux上吗?如果是这样,请检查ulimit和/或rlimit以完成任务。@BrianCain好主意。不过--我该怎么做?:)这些数据帧没有那么大好吧…在查看您的编辑之后,您的方法似乎是错误的,IMHO。请解释您打算做什么。似乎您想将
合并的
剪切到一组特定的行中。行中的内容是什么?我想您可以以更简单的方式解决此问题。“大约2 TB的RAM和数百个线程"--沃瑟斯。首先,你是在linux上吗?如果是的话,检查ulimit和/或rlimit以完成任务。@BrianCain好主意。不过--我怎么做?:)这些数据帧没有那么大好吧……在查看你的编辑后,你的方法似乎是错误的,IMHO。请解释你打算做什么。看起来你想将
合并的
剪辑到一个特定的文件中fic行集。
行中有什么?
?我想你可以用更简单的方式解决这个问题。这并不能回答OP的问题。OP希望在
“column1”
上有一个外部联接,并且得到一个
内存错误
“column1”