Python Dask client.map返回Dask数据帧上的KeyError

Python Dask client.map返回Dask数据帧上的KeyError,python,pandas,dataframe,scikit-learn,dask,Python,Pandas,Dataframe,Scikit Learn,Dask,我正在尝试使用python dask创建一个更新的随机森林分类示例,如最初所述 当我试图将训练集传递给Client.map函数时,它抛出了一个keyrerror,根据错误消息,我不确定我做错了什么 以下是我所拥有的: from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from distributed import Client, pro

我正在尝试使用python dask创建一个更新的随机森林分类示例,如最初所述

当我试图将训练集传递给Client.map函数时,它抛出了一个keyrerror,根据错误消息,我不确定我做错了什么

以下是我所拥有的:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from distributed import Client, progress, wait
c = Client('127.0.0.1:8786')
c

columns = ['trip_distance', 'pickup_longitude', 'pickup_latitude', 
           'dropoff_longitude', 'dropoff_latitude', 'payment_type', 
           'fare_amount', 'mta_tax', 'tip_amount', 'tolls_amount']

import dask.dataframe as dd

dfs = dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv', 
                 parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
                 storage_options={'anon': True})
dfs = c.persist(dfs)
progress(dfs)

def fit(df):
    est = RandomForestClassifier(n_estimators=4)
    est.fit(df[columns], df.passenger_count)
    return est

train, test = dfs.random_split([0.7, 0.3])

estimators = c.map(fit, train)
progress(estimators, complete=False)
这会引发错误:

KeyError                                  Traceback (most recent call last)
/opt/anaconda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-61-9846f819ffca> in <module>()
      8 train, test = dfs.random_split([0.7, 0.3])
      9 
---> 10 estimators = c.map(fit, train)
     11 progress(estimators, complete=False)

/opt/anaconda/lib/python3.5/site-packages/distributed/client.py in map(self, func, *iterables, **kwargs)
   1243             raise ValueError("Only use allow_other_workers= if using workers=")
   1244 
-> 1245         iterables = list(zip(*zip(*iterables)))
   1246         if isinstance(key, list):
   1247             keys = key

/opt/anaconda/lib/python3.5/site-packages/dask/dataframe/core.py in __getitem__(self, key)
   2284 
   2285             # error is raised from pandas
-> 2286             meta = self._meta[_extract_meta(key)]
   2287             dsk = dict(((name, i), (operator.getitem, (self._name, i), key))
   2288                        for i in range(self.npartitions))

/opt/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

/opt/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

/opt/anaconda/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

/opt/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

/opt/anaconda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0
keyrerror回溯(最近一次调用)
/get_loc中的opt/anaconda/lib/python3.5/site-packages/pandas/core/index/base.py(self、key、method、tolerance)
2524请尝试:
->2525返回发动机。获取位置(钥匙)
2526除键错误外:
pandas/_libs/index.pyx在pandas中。_libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx在pandas中。_libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi在pandas._libs.hashtable.PyObjectHashTable.get_item()中
pandas/_libs/hashtable_class_helper.pxi在pandas._libs.hashtable.PyObjectHashTable.get_item()中
关键错误:0
在处理上述异常期间,发生了另一个异常:
KeyError回溯(最近一次呼叫最后一次)
在()
8列,测试=dfs。随机分割([0.7,0.3])
9
--->10个估计器=c.map(拟合,训练)
11进度(估计值,完成=错误)
/地图中的opt/anaconda/lib/python3.5/site-packages/distributed/client.py(self、func、*iterables、**kwargs)
1243 raise VALUE ERROR(“仅允许使用其他工人=如果使用工人=”)
1244
->1245 iterables=列表(zip(*zip(*iterables)))
1246如果存在(键,列表):
1247键=键
/opt/anaconda/lib/python3.5/site-packages/dask/dataframe/core.py in\uuuuuu getitem\uuuuuuuu(self,key)
2284
2285#错误来自熊猫
->2286元=self.\u元[\u提取\u元(键)]
2287 dsk=dict((名称,i),(operator.getitem,(self.\u名称,i),键))
2288用于范围内的i(自npartitions))
/opt/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in\uuuuu getitem\uuuuu(self,key)
2137返回自我。\u获取项目\u多级(键)
2138其他:
->2139返回self.\u getitem\u列(键)
2140
2141 def_getitem_列(自身,键):
/opt/anaconda/lib/python3.5/site-packages/pandas/core/frame.py在_getitem_列中(self,key)
2144#获取列
2145如果self.columns.u是唯一的:
->2146返回自我。获取项目缓存(密钥)
2147
2148#重复列和可能的降维
/缓存中的opt/anaconda/lib/python3.5/site-packages/pandas/core/generic.py(self,item)
1840 res=cache.get(项)
1841如果res为无:
->1842 values=self.\u data.get(项目)
1843 res=自身。\框\项\值(项,值)
1844缓存[项目]=res
/get中的opt/anaconda/lib/python3.5/site-packages/pandas/core/internals.py(self、item、fastpath)
3841
3842如果不是isna(项目):
->3843 loc=自身项目。获取loc(项目)
3844其他:
3845索引器=np.arange(len(self.items))[isna(self.items)]
/get_loc中的opt/anaconda/lib/python3.5/site-packages/pandas/core/index/base.py(self、key、method、tolerance)
2525返回发动机。获取位置(钥匙)
2526除键错误外:
->2527返回self.\u引擎。获取self.\u loc(self.\u可能\u cast\u索引器(键))
2528
2529 indexer=self.get_indexer([key],method=method,tolerance=tolerance)
pandas/_libs/index.pyx在pandas中。_libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx在pandas中。_libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi在pandas._libs.hashtable.PyObjectHashTable.get_item()中
pandas/_libs/hashtable_class_helper.pxi在pandas._libs.hashtable.PyObjectHashTable.get_item()中
关键错误:0

根据错误输出,错误似乎在
estimators=c.map(fit,train)
语句中触发,这表明
def-fit(df):
可能需要修改,以便将dask数据帧正确地传递给
est.fit()
,但我不确定如何传递。

我不确定能否将dask数据帧传递给scikit learn。您是否尝试过如本文所述使用
dask_ml