Python 当索引为分类时，计算/合并dask数据帧时出现问题_Python_Dask_Dask Dataframe

Python 当索引为分类时，计算/合并dask数据帧时出现问题

python dask

Python 当索引为分类时，计算/合并dask数据帧时出现问题,python,dask,dask-dataframe,Python,Dask,Dask Dataframe,我正在尝试使用dask处理不适合内存的数据集。它是各种“ID”的时间序列数据。在阅读了dask文档之后，我选择使用“拼花”文件格式并按“ID”进行分区然而，在阅读拼花地板并设置索引时，我遇到了一个“TypeError：对于联合排序的类别，所有类别必须相同”，我自己没有解决这个问题此代码复制了我遇到的问题： import dask.dataframe as dd import numpy as np import pandas as pd import traceback # create

我正在尝试使用dask处理不适合内存的数据集。它是各种“ID”的时间序列数据。在阅读了dask文档之后，我选择使用“拼花”文件格式并按“ID”进行分区

然而，在阅读拼花地板并设置索引时，我遇到了一个“TypeError：对于联合排序的类别，所有类别必须相同”，我自己没有解决这个问题

此代码复制了我遇到的问题：

import dask.dataframe as dd
import numpy as np
import pandas as pd
import traceback

# create ids
ids = ["AAA", "BBB", "CCC", "DDD"]

# create data
df = pd.DataFrame(index=np.random.choice(ids, 50), data=np.random.rand(50, 1), columns=["FOO"]).reset_index().rename(columns={"index": "ID"})
# serialize  to parquet
f = r"C:/temp/foo.pq"
df.to_parquet(f, compression='gzip', engine='fastparquet', partition_cols=["ID"])
# read with dask
df = dd.read_parquet(f)

try:
    df = df.set_index("ID")
except Exception as ee:
    print(traceback.format_exc())

此时，我得到以下错误：

~\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\arrays\categorical.py in check_for_ordered(self, op)
   1492         if not self.ordered:
   1493             raise TypeError(
-> 1494                 f"Categorical is not ordered for operation {op}\n"
   1495                 "you can use .as_ordered() to change the "
   1496                 "Categorical to an ordered one\n"

TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one

然后我做了：

# we order the categorical
df.ID = df.ID.cat.as_ordered()
df = df.set_index("ID")

而且，当我试图使用

df.compute（scheduler=“processes”）

时，我得到了前面提到的类型错误：

try:
    schd_str = 'processes'
    aa = df.compute(scheduler=schd_str)
    print(f"{schd_str}: OK")
except:
    print(f"{schd_str}: KO")
    print(traceback.format_exc())

给出：

Traceback (most recent call last):
  File "<ipython-input-6-e15c4e86fee2>", line 3, in <module>
    aa = df.compute(scheduler=schd_str)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 166, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in compute
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in <listcomp>
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 103, in finalize
    return _concat(results)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 98, in _concat
    else methods.concat(args2, uniform=True, ignore_index=ignore_index)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
    ignore_index=ignore_index,
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 431, in concat_pandas
    ind = concat([df.index for df in dfs])
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
    ignore_index=ignore_index,
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 400, in concat_pandas
    return pd.CategoricalIndex(union_categoricals(dfs), name=dfs[0].name)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\dtypes\concat.py", line 352, in union_categoricals
    raise TypeError("Categorical.ordered must be the same")
TypeError: Categorical.ordered must be the same

回溯（最近一次呼叫最后一次）：
文件“”，第3行，在
aa=df.compute（调度程序=schd_str）
文件“C:\Users\xxx\.conda\envs\env\u dask\u py37\lib\site packages\dask\base.py”，第166行，在compute中
（结果，）=compute（自我，遍历=False，**kwargs）
文件“C:\Users\xxx\.conda\envs\env\u dask\u py37\lib\site packages\dask\base.py”，第438行，在compute中
返回重新打包（[f（r，*a）用于r，（f，a）压缩（结果，邮政编码）]）
文件“C:\Users\xxx\.conda\envs\env_dask_py37\lib\site packages\dask\base.py”，第438行，在
返回重新打包（[f（r，*a）用于r，（f，a）压缩（结果，邮政编码）]）
文件“C:\Users\xxx\.conda\envs\env_dask_py37\lib\site packages\dask\dataframe\core.py”，第103行，在finalize中
返回（结果）
文件“C:\Users\xxx\.conda\envs\env_dask_py37\lib\site packages\dask\dataframe\core.py”，第98行，在_concat中
else methods.concat（args2，uniform=True，ignore\u index=ignore\u index）
concat中的文件“C:\Users\xxx\.conda\envs\env_dask_py37\lib\site packages\dask\dataframe\methods.py”第383行
忽略索引=忽略索引，
文件“C:\Users\xxx\.conda\envs\env\u dask\u py37\lib\site packages\dask\dataframe\methods.py”，第431行，在concat\u中
ind=concat（[df.df中df的索引]）
concat中的文件“C:\Users\xxx\.conda\envs\env_dask_py37\lib\site packages\dask\dataframe\methods.py”第383行
忽略索引=忽略索引，
文件“C:\Users\xxx\.conda\envs\env\u dask\u py37\lib\site packages\dask\dataframe\methods.py”，第400行，在concat\u中
返回pd.CategoricalIndex（联合分类（dfs），name=dfs[0].name）
文件“C:\Users\xxx\.conda\envs\env\u dask\u py37\lib\site packages\pandas\core\dtypes\concat.py”，第352行，联合目录
raise TypeError（“Categorical.ordered必须相同”）
TypeError:Category.ordered必须相同

令人惊讶的是，使用

df.compute（scheduler=“threads”）

、

df.compute（scheduler=“synchronous”）

或根本不设置索引都可以正常工作

然而，这似乎不是我应该做的事情，因为我实际上正在尝试合并其中的几个数据集，并且认为设置索引将导致比不设置任何数据集更快的速度。（我在尝试合并以这种方式索引的两个数据帧时遇到了相同的错误）

我试着检查df.\u meta，结果发现我的类别是“已知”的，因为它们应该是

我还读到一些类似的东西，但不知何故没有找到解决办法

谢谢你的帮助，

很有意思。我建议在会议上提出一个问题