Python 3.x 找出并分配Dask数据帧中的列类型

Python 3.x 找出并分配Dask数据帧中的列类型,python-3.x,pandas,dask,dask-distributed,dask-delayed,Python 3.x,Pandas,Dask,Dask Distributed,Dask Delayed,目前,我正在使用熊猫数据帧。我迭代行并根据数据类型的数量将该数据类型分配给该列。假设我有一个如下所示的数据帧: column1 column2 column3 0 1.43816 lots 1.9837 1 -0.28378 of 0.01758 2 0.552564 string 0.257276 3 dummy inthis -1.34906 4 string column 1.33308 5

目前,我正在使用熊猫数据帧。我迭代行并根据数据类型的数量将该数据类型分配给该列。假设我有一个如下所示的数据帧:

    column1   column2   column3
0   1.43816      lots    1.9837
1  -0.28378        of   0.01758
2  0.552564    string  0.257276
3     dummy    inthis  -1.34906
4    string    column   1.33308
5  0.944862 -0.657849    dadada
我的代码如下所示:(没有Dask的工作示例)

由于我的实际数据量很大,我想使用Dask来增加一些可伸缩性,并通过在确定列的数据类型后丢弃数据来降低内存使用率,也可以通过不将整个数据集加载到内存中来实现。(同时加快处理速度)。但是,当我想迭代dask数据帧行时,它会抛出一个错误:
NotImplementedError:Series getitem in仅支持其他具有匹配分区结构的Series对象,在第
行中为列中的行
。Dask数据帧中不支持行拆分如何使用Dask数据帧实现相同的功能?我还考虑将数据帧逐列拆分,并并行执行此操作如何在Dask(Dask.distributed,因为我考虑使用机器集群)中并行化此操作(for循环)?

以及我的“不工作”Dask代码:

import numpy as np
import pandas as pd
import dask.dataframe as dd

def is_number(column, column_length):
    count = 0
    for row in column:
        if isinstance(row, np.int) == True and \
                str(row) != 'True' and str(row) != 'False':
            count += 1
        elif isinstance(row, np.float) == True:
            count += 1
    if count >= column_length*0.51:
        column = pd.to_numeric(column, errors='coerce')
    return column

data = {'column1': [1.438161, -0.283780, 0.552564, 'dummy', 'string', 0.944862],
        'column2': ['lots', 'of', 'string', 'inthis', 'column', -0.657849],
        'column3': [1.983704, 0.017580, 0.257276, -1.349062, 1.333079, 'dadada']}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=8)
df = df.repartition(partition_size="100MB")
print(df)
print(df.dtypes)
column_names = df.columns
for column in column_names:
    column_length = len(df[column])
    df[column] = is_number(df[column], column_length)
print(df.dtypes)
和完全回溯:

Traceback (most recent call last):
  File "/home/dodzilla-ai/.PyCharm2019.2/config/scratches/scratch_1.py", line 28, in <module>
    df[column] = is_number(df[column], column_length)
  File "/home/dodzilla-ai/.PyCharm2019.2/config/scratches/scratch_1.py", line 7, in is_number
    for row in column:
  File "/home/dodzilla-ai/Projects/project/venv/lib/python3.6/site-packages/dask/dataframe/core.py", line 2673, in __getitem__
    "Series getitem in only supported for other series objects "
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure

回溯(最近一次呼叫最后一次):
文件“/home/dodzilla ai/.PyCharm2019.2/config/scratch/scratch_1.py”,第28行,在
df[column]=is_编号(df[column],column_长度)
文件“/home/dodzilla ai/.PyCharm2019.2/config/scratch/scratch_1.py”,第7行,在is_编号中
对于列中的行:
文件“/home/dodzilla ai/Projects/project/venv/lib/python3.6/site packages/dask/dataframe/core.py”,第2673行,在__
“仅其他系列对象支持中的系列getitem”
NotImplementedError:Series getitem in仅支持具有匹配分区结构的其他Series对象

一种方法是:转置数据帧,逐分区处理数据。。。
Traceback (most recent call last):
  File "/home/dodzilla-ai/.PyCharm2019.2/config/scratches/scratch_1.py", line 28, in <module>
    df[column] = is_number(df[column], column_length)
  File "/home/dodzilla-ai/.PyCharm2019.2/config/scratches/scratch_1.py", line 7, in is_number
    for row in column:
  File "/home/dodzilla-ai/Projects/project/venv/lib/python3.6/site-packages/dask/dataframe/core.py", line 2673, in __getitem__
    "Series getitem in only supported for other series objects "
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure