Python 使用dask对数据通道应用过滤功能_Python_Dask

Python 使用dask对数据通道应用过滤功能

python dask

Python 使用dask对数据通道应用过滤功能,python,dask,Python,Dask,我使用pandas编写了一个函数来对数据进行下采样，但我拥有的一些数据集不适合内存，因此我想用dask进行测试，这是我现在的工作代码： def sample_df(df,target_column = "target",positive_percentage = 35,index_col="index"): """ Takes as input a data frame with imbalanced records, e.g. x% of positive cases, and

我使用pandas编写了一个函数来对数据进行下采样，但我拥有的一些数据集不适合内存，因此我想用dask进行测试，这是我现在的工作代码：

def sample_df(df,target_column = "target",positive_percentage = 35,index_col="index"):
    """
    Takes as input a data frame with imbalanced records, e.g. x% of positive cases, and returns
    a dataframe with the specified percentage, e.g 10%.
    This is accomplished by downsampling the majority class.



    """

    positive_cases =  df[df[target_column]==1][index_col]
    number_of_samples = int(((100/positive_percentage)-1)*len(positive_cases))
    negative_cases =  list(set(df[index_col]) - set(positive_cases))

    try:
        negative_sample = random.sample(negative_cases,number_of_samples)
    except ValueError:
        print ("The requests percentage is not valid for this dataset")
        return pd.DataFrame()

    final_sample = list(negative_sample) + list(positive_cases)
    #df = df.iloc[final_sample]
    df = df[df[index_col].isin(final_sample) ] 
    #df = df.reset_index(drop=True)

    print ("New percentage is: ",  df[target_column].sum()/len(df[target_column])*100 )

    return df

该功能可用作：

import pandas as pd
import random
from sklearn.datasets import make_classification

x,y = make_classification(100000,500)
df = pd.DataFrame(x)
df["target"] = y
df["id"] = 1 
df["id"] = df["id"].cumsum()
output_df = sample_df(df,target_column = "target",positive_percentage = 65,index_col="id")

对于小数据集，pandas可以很好地使用，但是当我尝试使用pandas或dask都不适合内存的数据集时，计算机崩溃了

如何将此函数应用于dask读取的每个数据块，然后合并所有数据块

此方法适用于纯熊猫，不需要dask，具体取决于子采样数据集的大小。您可以将df分块，然后将过滤器应用于每个分块，然后将每个分块附加到一个空数据帧。在块上执行操作就像在df上执行操作一样。我将从一个文件开始，因为您说过不能将数据加载到内存中。因此，我将函数中的df arg更改为infle，并添加一个chunk_size参数，并将默认值设置为10000，因此每个chunk将被处理为10000行：

def sample_df(infile,target_column = "target",positive_percentage = 35,index_col="index", chunk_size=10000):
    """
    Takes as input a data frame with imbalanced records, e.g. x% of positive cases, and returns
    a dataframe with the specified percentage, e.g 10%.
    This is accomplished by downsampling the majority class.
    """
    df = pd.DataFrame()
    for chunk in pd.read_csv(infile, chunksize=chunk_size):
        positive_cases =  chunk[chunk[target_column]==1][index_col]
        number_of_samples = int(((100/positive_percentage)-1)*len(positive_cases))
        negative_cases =  list(set(chunk[index_col]) - set(positive_cases))

        try:
            negative_sample = random.sample(negative_cases,number_of_samples)
        except ValueError:
            print ("The requests percentage is not valid for this dataset")
            return pd.DataFrame()

        final_sample = list(negative_sample) + list(positive_cases)
        #subdf = chunk.iloc[final_sample]
        subdf = chunk[chunk[index_col].isin(final_sample) ] 
        #subdf = chunk.reset_index(drop=True)
        # append each subsampled chunk to your df
        df = df.append(subdf)

    print ("New percentage is: ",  df[target_column].sum()/len(df[target_column])*100 )

    return df

这样做将对每个数据块进行子采样，而不是对整个df进行子采样

是否可以使用dask执行此操作，并且dask处理自动加入文件？我希望可以选择在不更改代码的情况下将任务迁移到集群。虽然上述代码不会因为您的函数而失败，但当您创建的数据帧内存过大时（这是在您运行函数之前发生的），则会失败。我事先没有看到这个，但是这使得你的函数与问题无关，所以我真的不知道你为什么把它包括进去。不，这不是问题所在。我拥有的数据集存储在磁盘中，而不是每次都创建数据集。该示例的目的是提供一个最小的工作示例，问题保持不变，使用dask进行操作