Python 为什么Dask使用from_pandas计算数据帧比直接使用Dask读取更快？_Python_Python 3.x_Pandas_Dask_Dask Distributed

Python 为什么Dask使用from_pandas计算数据帧比直接使用Dask读取更快？

python python-3.x pandas dask

Python 为什么Dask使用from_pandas计算数据帧比直接使用Dask读取更快？,python,python-3.x,pandas,dask,dask-distributed,Python,Python 3.x,Pandas,Dask,Dask Distributed,我在dask中以不同的方式运行了相同的数据集。我发现其中一种方法的速度几乎是另一种方法的10倍！！！我试图找到原因，但没有成功 1.完全与达斯克有关 2.熊猫+达斯克有人知道为什么？有没有其他更快的方法呢？请注意： dd.read\u csv。。。实际上他什么都不读。只是构造计算树的步骤。直到运行compute时，整个计算树都已构建到目前为止，已实际执行，包括读取两个数据帧。因此，在第一种变体中，定时操作包括：读取两个数据帧，重新划分它们，最后是合并本身。在第二种变体中，就时间

我在dask中以不同的方式运行了相同的数据集。我发现其中一种方法的速度几乎是另一种方法的10倍！！！我试图找到原因，但没有成功

1.完全与达斯克有关 2.熊猫+达斯克有人知道为什么？有没有其他更快的方法呢？

请注意：

dd.read\u csv。。。实际上他什么都不读。只是构造计算树的步骤。直到运行compute时，整个计算树都已构建到目前为止，已实际执行，包括读取两个数据帧。因此，在第一种变体中，定时操作包括：

读取两个数据帧，重新划分它们，最后是合并本身。在第二种变体中，就时间而言，情况有所不同。这两个数据帧之前都已读取，因此计时操作仅包括重新分区和合并

显然，源数据帧很大，读取它们需要花费很多时间相当长的时间，在第二种变体中没有考虑

尝试另一个测试：创建一个函数，该函数：

读取两个数据帧pd。读取\u csv。。。执行其余步骤重新分区和合并。然后计算此函数的执行时间

我想，执行时间可能会比过去更长第一种变体，因为：

在第一种变体中，同时读取两个数据帧通过不同的核心，在上面提出的测试中，按顺序读取。

是的，你有理由，谢谢。关于熊猫合并，我真的很惊讶，因为合并操作只有463ms，是合并近2M行的最快解决方案。我想这是因为我按索引合并。。。

import dask.dataframe as dd
from multiprocessing import cpu_count

#Count the number of cores
cores = cpu_count()

#read and part the dataframes by the number of cores
english = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
               sep='\r', header=None, names=['ingles'], dtype={'ingles':str})
english = english.repartition(npartitions=cores)
spanish = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
              sep='\r', header=None, names=['espanol'], dtype={'espanol':str})
spanish = english.repartition(npartitions=cores)

#compute
%time total_dd = dd.merge(english, spanish, left_index=True, right_index=True).compute()

Out: 9.77 seg

import pandas as pd
import dask.dataframe as dd
from multiprocessing import cpu_count

#Count the number of cores
cores = cpu_count()

#Read the Dataframe and part by the number of cores
pd_english = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
                      sep='\r', header=None, names=['ingles'])

pd_spanish = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
                      sep='\r', header=None, names=['espanol'])
english_pd = dd.from_pandas(pd_english, npartitions=cores)
spanish_pd = dd.from_pandas(pd_spanish, npartitions=cores)

#compute
%time total_pd = dd.merge(english_pd, spanish_pd, left_index=True, right_index=True).compute()

Out: 1.31 seg