Python MemoryError将两个数据帧与pandas和Dask合并——我如何才能做到这一点?
我有两个熊猫数据帧。我想合并这两个数据帧,但我一直遇到内存错误。我可以使用什么样的工作环境 以下是设置:Python MemoryError将两个数据帧与pandas和Dask合并——我如何才能做到这一点?,python,pandas,merge,out-of-memory,dask,Python,Pandas,Merge,Out Of Memory,Dask,我有两个熊猫数据帧。我想合并这两个数据帧,但我一直遇到内存错误。我可以使用什么样的工作环境 以下是设置: import pandas as pd df1 = pd.read_cvs("first1.csv") df2 = pd.read_csv("second2.csv") print(df1.shape) # output: (4757076, 4) print(df2.shape) # output: (428764, 45) df1.head column1 begin
import pandas as pd
df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
print(df1.shape) # output: (4757076, 4)
print(df2.shape) # output: (428764, 45)
df1.head
column1 begin end category
0 class1 10001 10468 third
1 class1 10469 11447 third
2 class1 11505 11675 fourth
3 class2 15265 15355 seventh
4 class2 15798 15849 second
print(df2.shape) # (428764, 45)
column1 begin ....
0 class1 10524 ....
1 class1 10541 ....
2 class1 10549 ....
3 class1 10565 ...
4 class1 10596 ...
我只想在“column1”上合并这两个数据帧。但是,这总是会导致内存错误
让我们首先在pandas中进行尝试,在一个具有大约2 TB RAM和数百个线程的系统上:
import pandas as pd
df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
merged = pd.merge(df1, df2, on="column1", how="outer", suffixes=("","_repeated")
下面是我得到的错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
sort=self.sort, how=self.how)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 160, in pandas.algos.full_outer_join (pandas/algos.c:61256)
MemoryError
That didn't work. Let's try with dask:
import pandas as pd
import dask.dataframe as dd
from numpy import nan
ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
Here's the error I get:
Traceback (most recent call last):
File "repeat_finder.py", line 15, in <module>
merged = dd.merge(ddf1, ddf2,on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
File "/path/python3.5/site-packages/dask/base.py", line 78, in compute
return compute(self, **kwargs)[0]
File "/path/python3.5/site-packages/dask/base.py", line 178, in compute
results = get(dsk, keys, **kwargs)
File "/path/python3.5/site-packages/dask/threaded.py", line 69, in get
**kwargs)
File "/path/python3.5/site-packages/dask/async.py", line 502, in get_async
raise(remote_exception(res, tb))
dask.async.MemoryError:
Traceback
---------
File "/path/python3.5/site-packages/dask/async.py", line 268, in execute_task
result = _execute_task(task, data)
File "/path/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/path/python3.5/site-packages/dask/dataframe/methods.py", line 221, in merge
suffixes=suffixes, indicator=indicator)
File "/path/python3.5/site-packages/pandas/tools/merge.py", line 59, in merge
return op.get_result()
File "/path/python3.5/site-packages/pandas/tools/merge.py", line 503, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/path/python3.5/site-packages/pandas/tools/merge.py", line 667, in _get_join_info
right_indexer) = self._get_join_indexers()
File "/path/python3.5/site-packages/pandas/tools/merge.py", line 647, in _get_join_indexers
how=self.how)
File "/path/python3.5/site-packages/pandas/tools/merge.py", line 876, in _get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 226, in pandas._join.full_outer_join (pandas/src/join.c:11286)
File "pandas/src/join.pyx", line 231, in pandas._join._get_result_indexer (pandas/src/join.c:11474)
File "path/python3.5/site-packages/pandas/core/algorithms.py", line 1072, in take_nd
out = np.empty(out_shape, dtype=dtype, order='F')
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/nfs/sw/python/python-3.5.1/lib/python3.5/site packages/pandas/tools/merge.py”,第39行,合并中
返回操作获取结果()
文件“/nfs/sw/python/python-3.5.1/lib/python3.5/site packages/pandas/tools/merge.py”,第217行,在get_result中
加入索引,左索引器,右索引器=self.\u获取加入信息()
文件“/nfs/sw/python/python-3.5.1/lib/python3.5/site packages/pandas/tools/merge.py”,第353行,在获取加入信息中
sort=self.sort,how=self.how)
文件“/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py”,第559行,在获取加入索引器中
返回连接函数(lkey、rkey、count、**kwargs)
文件“pandas/src/join.pyx”,第160行,在pandas.algos.full_outer_join(pandas/algos.c:61256)中
记忆者
那没用。让我们试试dask:
作为pd进口熊猫
将dask.dataframe作为dd导入
来自numpy INPORT nan
ddf1=dd.from_熊猫(df1,npartitions=2)
ddf2=dd.from_熊猫(df2,npartitions=2)
merged=dd.merge(ddf1,ddf2,on=“column1”,how=“outer”,后缀=(“”,“\u repeat”)。计算(num\u workers=60)
下面是我得到的错误:
回溯(最近一次呼叫最后一次):
文件“repeat_finder.py”,第15行,在
merged=dd.merge(ddf1,ddf2,on=“column1”,how=“outer”,后缀=(“”,“\u repeat”)。计算(num\u workers=60)
文件“/path/python3.5/site packages/dask/base.py”,第78行,在compute中
返回计算(自身,**kwargs)[0]
文件“/path/python3.5/site packages/dask/base.py”,第178行,在compute中
结果=获取(dsk、键、**kwargs)
get中的文件“/path/python3.5/site packages/dask/threaded.py”,第69行
**kwargs)
文件“/path/python3.5/site packages/dask/async.py”,第502行,在get\u async中
raise(远程_异常(res,tb))
dask.async.MemoryError:
回溯
---------
文件“/path/python3.5/site packages/dask/async.py”,第268行,在执行任务中
结果=_执行_任务(任务、数据)
文件“/path/python3.5/site packages/dask/async.py”,第249行,在执行任务中
返回函数(*args2)
文件“/path/python3.5/site packages/dask/dataframe/methods.py”,第221行,合并中
后缀=后缀,指标=指标)
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第59行,合并中
返回操作获取结果()
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第503行,在get_结果中
加入索引,左索引器,右索引器=self.\u获取加入信息()
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第667行,在获取加入信息中
右索引器)=self.\u获取\u加入\u索引器()
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第647行,在索引器中
how=自我,how)
文件“/path/python3.5/site packages/pandas/tools/merge.py”,第876行,在索引器中
返回连接函数(lkey、rkey、count、**kwargs)
文件“pandas/src/join.pyx”,第226行,pandas.\u join.full\u outer\u join(pandas/src/join.c:11286)
文件“pandas/src/join.pyx”,第231行,在pandas.\u join.\u获取结果\u索引器(pandas/src/join.c:11474)中
文件“path/python3.5/site packages/pandas/core/algorithms.py”,第1072行,在take\n中
out=np.empty(out\u-shape,dtype=dtype,order='F')
我怎样才能让它工作,即使它毫无效率可言
编辑:为了回应关于合并两列/索引的建议,我认为我不能这样做。以下是我尝试运行的代码:
import pandas as pd
import dask.dataframe as dd
df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
merged = merged[(ddf1.column1 == row.column1) & (ddf2.begin >= ddf1.begin) & (ddf2.begin <= ddf1.end)]
merged = dd.merge(ddf2, merged, on = ["column1"]).compute(num_workers=60)
merged.to_csv("output.csv", index=False)
将熊猫作为pd导入
将dask.dataframe作为dd导入
df1=pd.read_cvs(“first1.csv”)
df2=pd.read\U csv(“second2.csv”)
ddf1=dd.from_熊猫(df1,npartitions=2)
ddf2=dd.from_熊猫(df2,npartitions=2)
merged=dd.merge(ddf1,ddf2,on=“column1”,how=“outer”,后缀=(“”,“\u repeat”)。计算(num\u workers=60)
merged=merged[(ddf1.column1==row.column1)&(ddf2.begin>=ddf1.begin)&(ddf2.begin您不能只合并column1
上的两个数据帧,因为column1
不是两个数据帧中每个实例的唯一标识符。请尝试:
merged = pd.merge(df1, df2, on=["column1", "begin"], how="outer", suffixes=("","_repeated"))
如果在df2
中还有end
列,则可能需要尝试:
merged = pd.merge(df1, df2, on=["column1", "begin", "end"], how="outer", suffixes=("","_repeated"))
您不能只合并column1
上的两个数据框,因为column1
不是两个数据框中每个实例的唯一标识符。请尝试:
merged = pd.merge(df1, df2, on=["column1", "begin"], how="outer", suffixes=("","_repeated"))
如果在df2
中还有end
列,则可能需要尝试:
merged = pd.merge(df1, df2, on=["column1", "begin", "end"], how="outer", suffixes=("","_repeated"))
“大约2 TB的RAM和数百个线程”-wowsers。首先,你是在linux上吗?如果是这样,请检查ulimit和/或rlimit以完成任务。@BrianCain好主意。不过--我该怎么做?:)这些数据帧没有那么大好吧…在查看您的编辑之后,您的方法似乎是错误的,IMHO。请解释您打算做什么。似乎您想将合并的
剪切到一组特定的行中。行中的内容是什么?我想您可以以更简单的方式解决此问题。“大约2 TB的RAM和数百个线程"--沃瑟斯。首先,你是在linux上吗?如果是的话,检查ulimit和/或rlimit以完成任务。@BrianCain好主意。不过--我怎么做?:)这些数据帧没有那么大好吧……在查看你的编辑后,你的方法似乎是错误的,IMHO。请解释你打算做什么。看起来你想将合并的剪辑到一个特定的文件中fic行集。行中有什么?
?我想你可以用更简单的方式解决这个问题。这并不能回答OP的问题。OP希望在“column1”
上有一个外部联接,并且得到一个内存错误“column1”