Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/codeigniter/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Dask:从多个csv文件读取外部联接_Dask - Fatal编程技术网

Dask:从多个csv文件读取外部联接

Dask:从多个csv文件读取外部联接,dask,Dask,给出内部联接结果: import dask.dataframe as dd import numpy as np from dask import delayed df1 = pd.DataFrame({'a': np.arange(10), 'b': np.random.rand()}) df1 = df1.astype({'a':np.float64}) df2 = pd.DataFrame({'a': np.random.rand(5), 'c': 1}) df1.to_csv('df1

给出内部联接结果:

import dask.dataframe as dd
import numpy as np
from dask import delayed

df1 = pd.DataFrame({'a': np.arange(10), 'b': np.random.rand()})
df1 = df1.astype({'a':np.float64})
df2 = pd.DataFrame({'a': np.random.rand(5), 'c': 1})
df1.to_csv('df1.csv')
df2.to_csv('df2.csv')
dd.read_csv('*.csv').compute()
df1_delayed = delayed(lambda: df1)()
df2_delayed = delayed(lambda: df2)()
dd.from_delayed([df1_delayed, df2_delayed]).compute()
以及:

给出外部联接结果:

import dask.dataframe as dd
import numpy as np
from dask import delayed

df1 = pd.DataFrame({'a': np.arange(10), 'b': np.random.rand()})
df1 = df1.astype({'a':np.float64})
df2 = pd.DataFrame({'a': np.random.rand(5), 'c': 1})
df1.to_csv('df1.csv')
df2.to_csv('df2.csv')
dd.read_csv('*.csv').compute()
df1_delayed = delayed(lambda: df1)()
df2_delayed = delayed(lambda: df2)()
dd.from_delayed([df1_delayed, df2_delayed]).compute()
如何使read_csv在相同模式下工作

编辑:

即使将数据类型架构传递给pandas也不起作用:

          a         b    c
0  0.000000  0.218319  NaN
1  1.000000  0.218319  NaN
2  2.000000  0.218319  NaN
...

通常,dask.dataframe假设构成dask.dataframe的所有数据帧都具有相同的列和数据类型。如果不是这样的话,行为就是不明确的

如果您的CSV具有不同的列和数据类型,那么我建议使用dask.delayed,就像您在第二个示例中所做的那样,并在调用
dask.dataframe.from\u delayed
之前显式添加新的空列