Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用Dask库合并两个大数据帧_Python_Python 2.7_Dask - Fatal编程技术网

Python 使用Dask库合并两个大数据帧

Python 使用Dask库合并两个大数据帧,python,python-2.7,dask,Python,Python 2.7,Dask,我对达斯克很陌生。我正在尝试合并两个数据帧(一个在一个适合熊猫数据帧大小的小文件中,但为了方便起见,我将其用作dask数据帧,另一个非常大)。我尝试将结果保存在csv文件中,因为我知道它可能不适合数据帧 import pandas as pd import dask.dataframe as dd AF=dd.read_csv("../data/AuthorFieldOfStudy.csv") AF.columns=['AID','FID'] #extract subset of Autho

我对达斯克很陌生。我正在尝试合并两个数据帧(一个在一个适合熊猫数据帧大小的小文件中,但为了方便起见,我将其用作dask数据帧,另一个非常大)。我尝试将结果保存在csv文件中,因为我知道它可能不适合数据帧

import pandas as pd
import dask.dataframe as dd

AF=dd.read_csv("../data/AuthorFieldOfStudy.csv")
AF.columns=['AID','FID']

#extract subset of Authors to reduce final merge size
AF = AF.loc[AF['FID'] == '0271BC14']

#This is a large file 9 MB
PAA=dd.read_csv("../data/PAA.csv")
PAA.columns=['PID','AID', 'AffID']

result = dd.merge(AF,PAA, on='AID')

result.to_csv("../data/CompSciPaperAuthorAffiliations.csv").compute()
我发现以下错误,但我不太理解:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-6b2f889f44ff> in <module>()
     14 result = dd.merge(AF,PAA, on='AID')
     15 
---> 16 result.to_csv("../data/CompSciPaperAuthorAffiliations.csv").compute()

/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.pyc in to_csv(self, filename, **kwargs)
    936         """ See dd.to_csv docstring for more information """
    937         from .io import to_csv
--> 938         return to_csv(self, filename, **kwargs)
    939 
    940     def to_delayed(self):

/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in to_csv(df, filename, name_function, compression, compute, get, **kwargs)
    411     if compute:
    412         from dask import compute
--> 413         compute(*values, get=get)
    414     else:
    415         return values

/usr/local/lib/python2.7/dist-packages/dask/base.pyc in compute(*args, **kwargs)
    177         dsk = merge(var.dask for var in variables)
    178     keys = [var._keys() for var in variables]
--> 179     results = get(dsk, keys, **kwargs)
    180 
    181     results_iter = iter(results)

/usr/local/lib/python2.7/dist-packages/dask/threaded.pyc in get(dsk, result, cache, num_workers, **kwargs)
     74     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     75                         cache=cache, get_id=_thread_get_id,
---> 76                         **kwargs)
     77 
     78     # Cleanup pools associated to dead threads

/usr/local/lib/python2.7/dist-packages/dask/async.pyc in get_async(apply_async, num_workers, dsk, result, cache, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, dumps, loads, **kwargs)
    491                     _execute_task(task, data)  # Re-execute locally
    492                 else:
--> 493                     raise(remote_exception(res, tb))
    494             state['cache'][key] = res
    495             finish_task(dsk, key, state, results, keyorder.get)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 14: ordinal not in range(128)

Traceback
---------
  File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 268, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/shuffle.py", line 329, in collect
    res = p.get(part)
  File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 73, in get
    return self.get([keys], **kwargs)[0]
  File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 79, in get
    return self._get(keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/partd/encode.py", line 30, in _get
    for chunk in raw]
  File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 144, in deserialize
    for block, dt, shape in zip(b_blocks, dtypes, shapes)]
  File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 127, in deserialize
    l = decode(l)
  File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 114, in decode
    return list(map(decode, o))
  File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 110, in decode
    return [item.decode() for item in o]
UnicodeDecodeError回溯(最近一次呼叫上次)
在()
14结果=dd.merge(AF、PAA、on='AID')
15
--->16 result.to_csv(“../data/compscipalAuthorAffiliations.csv”).compute()
/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.pyc in to_csv(self,filename,**kwargs)
936“有关详细信息,请参阅dd.to_csv docstring”
937从.io导入到_csv
-->938返回到_csv(self,文件名,**kwargs)
939
940 def至_延迟(自):
/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in to_csv(df、文件名、函数名、压缩、计算、获取、**kwargs)
411如果计算:
412从dask导入计算
-->413计算(*值,get=get)
414其他:
415返回值
/计算中的usr/local/lib/python2.7/dist-packages/dask/base.pyc(*args,**kwargs)
177 dsk=合并(变量中的var的var.dask)
178个键=[var.\u keys()表示变量中的变量]
-->179结果=获取(dsk、键、**kwargs)
180
181结果\u iter=iter(结果)
/get中的usr/local/lib/python2.7/dist-packages/dask/threaded.pyc(dsk、result、cache、num_workers、**kwargs)
74 results=get_async(pool.apply_async)、len(pool.\u pool)、dsk、result、,
75 cache=cache,get\u id=\u thread\u get\u id,
--->76**夸尔格)
77
78#与死线程关联的清理池
/get_async中的usr/local/lib/python2.7/dist-packages/dask/async.pyc(应用_async、num_worker、dsk、结果、缓存、get_id、引发_异常、在本地重新运行_异常、回调、转储、加载、**kwargs)
491 _执行_任务(任务、数据)#在本地重新执行
492其他:
-->493上升(远程_异常(res、tb))
494状态['cache'][key]=res
495完成任务(dsk、键、状态、结果、keyorder.get)
UnicodeDecodeError:“ascii”编解码器无法解码第14位的字节0xc5:序号不在范围内(128)
回溯
---------
文件“/usr/local/lib/python2.7/dist packages/dask/async.py”,第268行,在执行任务中
结果=_执行_任务(任务、数据)
文件“/usr/local/lib/python2.7/dist packages/dask/async.py”,第249行,在执行任务中
返回函数(*args2)
collect中的文件“/usr/local/lib/python2.7/dist packages/dask/dataframe/shuffle.py”,第329行
res=p.get(部分)
文件“/usr/local/lib/python2.7/dist-packages/partd/core.py”,第73行,在get中
返回self.get([键],**kwargs)[0]
文件“/usr/local/lib/python2.7/dist-packages/partd/core.py”,get中第79行
返回自我。\u获取(键,**kwargs)
文件“/usr/local/lib/python2.7/dist packages/partd/encode.py”,第30行,在
对于原始数据块]
文件“/usr/local/lib/python2.7/dist packages/partd/pandas.py”,第144行,反序列化
对于块,dt,zip中的形状(b_块,dtypes,形状)]
文件“/usr/local/lib/python2.7/dist-packages/partd/numpy.py”,第127行,反序列化
l=解码(l)
文件“/usr/local/lib/python2.7/dist packages/partd/numpy.py”,第114行,解码
返回列表(映射(解码,o))
文件“/usr/local/lib/python2.7/dist packages/partd/numpy.py”,第110行,解码
返回[o中项目的item.decode()]

你能试一下
pip安装partd--upgrade
pip安装dask--upgrade
看看这是否能解决问题吗?@MRocklin我刚刚安装了partd和dask。所以它们是最新的。但是,为了再次检查,我运行了上面的命令,并且
需求已经是最新的:
在这种情况下,您能否同时提出一个问题,我怀疑这个问题不会在Python 3中出现