Python Dask get_假人不转换变量
我试图通过Python Dask get_假人不转换变量,python,pandas,dask,dummy-variable,Python,Pandas,Dask,Dummy Variable,我试图通过dask使用get_dummies,但它不会转换我的变量,也不会出错: >>> import dask.dataframe as dd >>> import pandas as pd >>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv') >>> df_d.head() uid gender 0 1
dask
使用get_dummies
,但它不会转换我的变量,也不会出错:
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_d.head()
uid gender
0 1 M
1 2 NaN
2 3 NaN
3 4 F
4 5 NaN
>>> daskDataCategorical = df_d[['gender']]
>>> daskDataDummies = dd.get_dummies(daskDataCategorical)
>>> daskDataDummies.head()
gender
0 M
1 NaN
2 NaN
3 F
4 NaN
>>> daskDataDummies.compute()
gender
0 M
1 NaN
2 NaN
3 F
4 NaN
5 F
6 M
7 F
8 M
9 F
>>>
pandas
等效(在新终端中运行以防万一)为:
我的理解是,它应该工作,但是否需要首先将其拉入
pandas
?如果是这样的话,它就违背了我使用它的目的,因为我的数据集(~500GB)不适合pandas
dataframe。我是不是误读了?蒂亚 在尝试使用get\u dummies
之前,您需要将字符串列转换为Categorical
。添加了一个dask.dataframe.get_dummies
,与pd.get_dummies
不同,如果尝试传递object
(字符串)列,则会出错
要获得分类的文件,您可以在dd.get\u dummies
之前使用.categorize
,或者在pandas>=0.19的情况下,在CSV中使用dtype
关键字,如
df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv', dtype={"gender": "category"})
下面是一个小例子:
In [2]: import dask.dataframe as dd
In [3]: bad = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', 'b', 'c']}), npartitions=2)
In [4]: bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/core.py:3699: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
warnings.warn(msg.format(n, len(r)))
Out[4]:
A
0 a
1 b
2 a
In [5]: dd.get_dummies(bad)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-5-651de6dd308c> in <module>()
----> 1 dd.get_dummies(bad)
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
68 if columns is None:
69 if (data.dtypes == 'object').any():
---> 70 raise NotImplementedError(not_cat_msg)
71 columns = data._meta.select_dtypes(include=['category']).columns
72 else:
NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.
In [7]: dd.get_dummies(bad.categorize()).compute()
Out[7]:
A_a A_b A_c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 1 0
4 0 0 1
[2]中的:将dask.dataframe作为dd导入
在[3]中:bad=dd.from_pandas(pd.DataFrame({“A”:['A','b','A','b','c']),npartitions=2)
在[4]中:bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site packages/dask/dask/dataframe/core.py:3699:UserWarning:head元素不足。需要5个元素,只有3个元素可用。尝试将较大的“npartitions”传递给“head”。
warnings.warn(消息格式(n,len(r)))
出[4]:
A.
0 a
1b
2A
在[5]:dd.get_假人(坏)
---------------------------------------------------------------------------
NotImplementedError回溯(最后一次调用)
在()
---->1 dd.get_假人(坏)
/get_dummies中的Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reformate.py(数据、前缀、前缀_sep、虚拟_na、列、稀疏、首先删除)
68如果列为无:
69如果(data.dtypes==“object”).any():
--->70提升未执行错误(非类别消息)
71列=数据。\元。选择数据类型(包括=['category'])。列
72其他:
NotImplementedError:'get_dummies'不支持非类别数据类型。请事先使用'df.categorize()`转换为分类数据类型。
在[7]中:dd.get\u dummies(bad.categorize()).compute()
出[7]:
A_A_b A_c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 1 0
4 0 0 1
Dask需要对get_dummies
进行分类,因为它需要知道需要创建的所有新虚拟变量。pandas不必担心这一点,因为您的所有数据都已存储在内存中。您不需要列表来选择列:df_d['gender']
-->df_d['gender']
真正的示例(现有代码)是200多个变量,感谢您的回复,这是有意义的。当您的示例运行时,将.categorize()添加到我的示例中会给我:回溯(最近一次调用):文件“”,在AttributeError中的第1行:“Series”对象没有属性“categorize”,您应该能够dd.get_假人(data.to_frame().categorize())
抱歉,这会给我带来此错误:raiseAttributeError(“'DataFrame'对象没有属性%r”%key)AttributeError:'DataFrame'对象没有属性'to_frame'
您确定在这两个位置运行相同的代码吗?您的第一条注释有问题,因为您有一个系列,而不是一个DataFrame;您的第二条注释有问题,因为您已经有一个DataFrame。”。
In [2]: import dask.dataframe as dd
In [3]: bad = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', 'b', 'c']}), npartitions=2)
In [4]: bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/core.py:3699: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
warnings.warn(msg.format(n, len(r)))
Out[4]:
A
0 a
1 b
2 a
In [5]: dd.get_dummies(bad)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-5-651de6dd308c> in <module>()
----> 1 dd.get_dummies(bad)
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
68 if columns is None:
69 if (data.dtypes == 'object').any():
---> 70 raise NotImplementedError(not_cat_msg)
71 columns = data._meta.select_dtypes(include=['category']).columns
72 else:
NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.
In [7]: dd.get_dummies(bad.categorize()).compute()
Out[7]:
A_a A_b A_c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 1 0
4 0 0 1