Python Dask get_假人不转换变量

Python Dask get_假人不转换变量,python,pandas,dask,dummy-variable,Python,Pandas,Dask,Dummy Variable,我试图通过dask使用get_dummies,但它不会转换我的变量,也不会出错: >>> import dask.dataframe as dd >>> import pandas as pd >>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv') >>> df_d.head() uid gender 0 1

我试图通过
dask
使用
get_dummies
,但它不会转换我的变量,也不会出错:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_d.head()
   uid gender
0    1      M
1    2    NaN
2    3    NaN
3    4      F
4    5    NaN
>>> daskDataCategorical = df_d[['gender']]
>>> daskDataDummies = dd.get_dummies(daskDataCategorical) 
>>> daskDataDummies.head()
  gender
0      M
1    NaN
2    NaN
3      F
4    NaN
>>> daskDataDummies.compute() 
  gender
0      M
1    NaN
2    NaN
3      F
4    NaN
5      F
6      M
7      F
8      M
9      F
>>>
pandas
等效(在新终端中运行以防万一)为:


我的理解是,它应该工作,但是否需要首先将其拉入
pandas
?如果是这样的话,它就违背了我使用它的目的,因为我的数据集(~500GB)不适合
pandas
dataframe。我是不是误读了?蒂亚

在尝试使用
get\u dummies
之前,您需要将字符串列转换为
Categorical
。添加了一个
dask.dataframe.get_dummies
,与
pd.get_dummies
不同,如果尝试传递
object
(字符串)列,则会出错

要获得分类的
文件,您可以在
dd.get\u dummies
之前使用
.categorize
,或者在pandas>=0.19的情况下,在CSV中使用
dtype
关键字,如

df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv', dtype={"gender": "category"})
下面是一个小例子:

In [2]: import dask.dataframe as dd

In [3]: bad = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', 'b', 'c']}), npartitions=2)

In [4]: bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/core.py:3699: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))
Out[4]:
   A
0  a
1  b
2  a

In [5]: dd.get_dummies(bad)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-5-651de6dd308c> in <module>()
----> 1 dd.get_dummies(bad)

/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
     68         if columns is None:
     69             if (data.dtypes == 'object').any():
---> 70                 raise NotImplementedError(not_cat_msg)
     71             columns = data._meta.select_dtypes(include=['category']).columns
     72         else:

NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.

In [7]: dd.get_dummies(bad.categorize()).compute()
Out[7]:
   A_a  A_b  A_c
0    1    0    0
1    0    1    0
2    1    0    0
3    0    1    0
4    0    0    1
[2]中的
:将dask.dataframe作为dd导入
在[3]中:bad=dd.from_pandas(pd.DataFrame({“A”:['A','b','A','b','c']),npartitions=2)
在[4]中:bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site packages/dask/dask/dataframe/core.py:3699:UserWarning:head元素不足。需要5个元素,只有3个元素可用。尝试将较大的“npartitions”传递给“head”。
warnings.warn(消息格式(n,len(r)))
出[4]:
A.
0 a
1b
2A
在[5]:dd.get_假人(坏)
---------------------------------------------------------------------------
NotImplementedError回溯(最后一次调用)
在()
---->1 dd.get_假人(坏)
/get_dummies中的Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reformate.py(数据、前缀、前缀_sep、虚拟_na、列、稀疏、首先删除)
68如果列为无:
69如果(data.dtypes==“object”).any():
--->70提升未执行错误(非类别消息)
71列=数据。\元。选择数据类型(包括=['category'])。列
72其他:
NotImplementedError:'get_dummies'不支持非类别数据类型。请事先使用'df.categorize()`转换为分类数据类型。
在[7]中:dd.get\u dummies(bad.categorize()).compute()
出[7]:
A_A_b A_c
0    1    0    0
1    0    1    0
2    1    0    0
3    0    1    0
4    0    0    1

Dask需要对
get_dummies
进行分类,因为它需要知道需要创建的所有新虚拟变量。pandas不必担心这一点,因为您的所有数据都已存储在内存中。

您不需要列表来选择列:
df_d['gender']
-->
df_d['gender']
真正的示例(现有代码)是200多个变量,感谢您的回复,这是有意义的。当您的示例运行时,将.categorize()添加到我的示例中会给我:回溯(最近一次调用):文件“”,在AttributeError中的第1行:“Series”对象没有属性“categorize”,您应该能够
dd.get_假人(data.to_frame().categorize())
抱歉,这会给我带来此错误:
raiseAttributeError(“'DataFrame'对象没有属性%r”%key)AttributeError:'DataFrame'对象没有属性'to_frame'
您确定在这两个位置运行相同的代码吗?您的第一条注释有问题,因为您有一个系列,而不是一个DataFrame;您的第二条注释有问题,因为您已经有一个DataFrame。”。
In [2]: import dask.dataframe as dd

In [3]: bad = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', 'b', 'c']}), npartitions=2)

In [4]: bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/core.py:3699: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))
Out[4]:
   A
0  a
1  b
2  a

In [5]: dd.get_dummies(bad)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-5-651de6dd308c> in <module>()
----> 1 dd.get_dummies(bad)

/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
     68         if columns is None:
     69             if (data.dtypes == 'object').any():
---> 70                 raise NotImplementedError(not_cat_msg)
     71             columns = data._meta.select_dtypes(include=['category']).columns
     72         else:

NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.

In [7]: dd.get_dummies(bad.categorize()).compute()
Out[7]:
   A_a  A_b  A_c
0    1    0    0
1    0    1    0
2    1    0    0
3    0    1    0
4    0    0    1