Python Dask数据帧。读取不尊重数据类型的csv

Python Dask数据帧。读取不尊重数据类型的csv,python,dask,Python,Dask,在尝试读取带有一些混合类型列的dask.dataframe的大文件时遇到错误,以确认“22”和“32”是实际列的名称,因此是字符串 df = dd.read_csv('s3://myfile.csv' , encoding = 'latin', sample=250000000, dtype={'22': str,'32': str}) df = df.compute() 返回: ValueError: Metadata mismatch found in

在尝试读取带有一些混合类型列的dask.dataframe的大文件时遇到错误,以确认“22”和“32”是实际列的名称,因此是字符串

df = dd.read_csv('s3://myfile.csv'
                 , encoding = 'latin', sample=250000000, dtype={'22': str,'32': str})

df = df.compute()
返回:

ValueError: Metadata mismatch found in `from_delayed`.

Partition type: `pandas.core.frame.DataFrame`
+--------------------+---------+----------+
| Column             | Found   | Expected |
+--------------------+---------+----------+
| 22                 | int64   | object   |
| 32                 | float64 | object   |
+--------------------+---------+----------+
我不确定为什么这里的read_csv没有将所有内容解释为字符串,即使采样显示列为整数/浮点格式,也应该能够解释为字符串

谢谢

更长的错误消息

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-65-2e7a5a158ac4> in <module>()
----> 1 df = df.compute()

~/anaconda3/envs/python3/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    164         dask.base.compute
    165         
--> 166         (result,) = compute(self, traverse=False, **kwargs)
    167         return result
    168 

~/anaconda3/envs/python3/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    435     keys = [x.__dask_keys__() for x in collections]
    436     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 437     results = schedule(dsk, keys, **kwargs)
    438     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    439 

~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2593                     should_rejoin = False
   2594             try:
-> 2595                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2596             finally:
   2597                 for f in futures.values():

~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1891                 direct=direct,
   1892                 local_worker=local_worker,
-> 1893                 asynchronous=asynchronous,
   1894             )
   1895 

~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    778         else:
    779             return sync(
--> 780                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    781             )
    782 

~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    346     if error[0]:
    347         typ, exc, tb = error[0]
--> 348         raise exc.with_traceback(tb)
    349     else:
    350         return result[0]

~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/utils.py in f()
    330             if callback_timeout is not None:
    331                 future = asyncio.wait_for(future, callback_timeout)
--> 332             result[0] = yield future
    333         except Exception as exc:
    334             error[0] = sys.exc_info()

~/anaconda3/envs/python3/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1750                             exc = CancelledError(key)
   1751                         else:
-> 1752                             raise exception.with_traceback(traceback)
   1753                         raise exc
   1754                     if errors == "skip":

~/anaconda3/envs/python3/lib/python3.6/site-packages/dask/dataframe/utils.py in check_meta()
    663     raise ValueError(
    664         "Metadata mismatch found%s.\n\n"
--> 665         "%s" % ((" in `%s`" % funcname if funcname else ""), errmsg)
    666     )
    667 

ValueError: Metadata mismatch found in `from_delayed`.

Partition type: `pandas.core.frame.DataFrame`
+--------------------+---------+----------+
| Column             | Found   | Expected |
+--------------------+---------+----------+
| 22                 | int64   | object   |
| 32                 | float64 | object   |
+--------------------+---------+----------+
---------------------------------------------------------------------------
ValueError回溯(最近一次调用上次)
在()
---->1 df=df.compute()
计算中的~/anaconda3/envs/python3/lib/python3.6/site-packages/dask/base.py(self,**kwargs)
164 dask.base.compute
165
-->166(结果,)=compute(自我,遍历=False,**kwargs)
167返回结果
168
计算中的~/anaconda3/envs/python3/lib/python3.6/site-packages/dask/base.py(*args,**kwargs)
435个键=[x.\uu dask\u keys\uuu()表示集合中的x]
436 postcomputes=[x.\uuu dask\u postcompute\uuuu()表示集合中的x]
-->437结果=时间表(dsk、键、**kwargs)
438返回重新打包([f(r,*a)用于r,(f,a)压缩(结果,邮政编码)])
439
get中的~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/client.py(self、dsk、key、限制、松散限制、资源、同步、异步、直接、重试、优先级、fifo\u超时、actors、**kwargs)
2593应该重新加入=错误
2594尝试:
->2595结果=自聚集(打包、异步=异步、直接=直接)
2596最后:
2597对于期货中的f.values():
聚集中的~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/client.py(self、futures、errors、direct、asynchronous)
1891直接=直接,
1892本地工人=本地工人,
->1893异步=异步,
1894             )
1895
~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/client.py处于同步状态(self、func、异步、回调超时、*args、**kwargs)
778其他:
779返回同步(
-->780 self.loop,func,*args,callback\u timeout=callback\u timeout,**kwargs
781             )
782
~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/utils.py处于同步状态(循环、函数、回调超时、*args、**kwargs)
346如果错误[0]:
347典型,exc,tb=错误[0]
-->348带回溯的提升exc(tb)
349其他:
350返回结果[0]
~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/utils.py in f()
330如果回调超时不是无:
331 future=asyncio.wait\u for(future,回调\u超时)
-->332结果[0]=未来收益率
333除作为exc的例外情况外:
334错误[0]=sys.exc_info()
~/anaconda3/envs/python3/lib/python3.6/site-packages/tornado/gen.py正在运行(self)
1097
1098尝试:
->1099值=future.result()
1100例外情况除外:
1101 self.had_exception=真
~/anaconda3/envs/python3/lib/python3.6/site-packages/distributed/client.py in\u-gather(self、futures、errors、direct、local\u-worker)
1750 exc=取消错误(键)
1751其他:
->1752带_回溯的引发异常(回溯)
1753年
1754如果错误==“跳过”:
检查meta()中的~/anaconda3/envs/python3/lib/python3.6/site-packages/dask/dataframe/utils.py
663提升值错误(
664“发现%s的元数据不匹配。\n\n”
-->665“%s%”(在“%s”中,如果funcname为else,则为funcname),errmsg)
666     )
667
ValueError:在“from\u delayed”中发现元数据不匹配。
分区类型:`pandas.core.frame.DataFrame`
+--------------------+---------+----------+
|应为|找到|列|
+--------------------+---------+----------+
|22 | int64 |对象|
|32 |浮动64 |对象|
+--------------------+---------+----------+

这是完整的错误消息吗?另外,请提供一个。感谢您在@AMC这里的回复,这不是完整的消息,但我已更新以包含该消息。就最小可重复性示例而言,除了创建遇到类似问题的虚拟数据集之外,我不确定如何调整当前信息。我将在周末完成这项工作,希望届时能够更新。