Python 基于另一个Dask数据帧创建子集Dask数据帧_Python_Pandas_Dataframe_Dask

Python 基于另一个Dask数据帧创建子集Dask数据帧

python pandas dataframe dask

Python 基于另一个Dask数据帧创建子集Dask数据帧,python,pandas,dataframe,dask,Python,Pandas,Dataframe,Dask,我有两个Dask数据帧，长度为5000的df1和长度为100000的df2，都有开始时间和结束时间列。我正在尝试查找df2的开始时间-结束时间间隔小于或等于df1的开始时间-结束时间间隔的df1行 import dask.dataframe as dd 出于测试目的，我读取源（Pandas）数据帧（df1和df2）然后将它们转换为Dask数据帧： dd1 = dd.from_pandas(df1, npartitions=2) dd2 = dd.from_pandas(df2, nparti

我有两个Dask数据帧，长度为5000的df1和长度为100000的df2，都有开始时间和结束时间列。我正在尝试查找df2的开始时间-结束时间间隔小于或等于df1的开始时间-结束时间间隔的df1行

import dask.dataframe as dd

出于测试目的，我读取源（Pandas）数据帧（df1和df2）然后将它们转换为Dask数据帧：

dd1 = dd.from_pandas(df1, npartitions=2)
dd2 = dd.from_pandas(df2, npartitions=2)

在您的程序中（我想），您将直接从各自的输入文件

然后定义以下函数：

def enclosesAny(row, other):
    return (other.start_time.ge(row.start_time) &
        other.end_time.le(row.end_time)).any()

实际计算分为两步：

计算encl列-应用上述函数的结果对于每行：

dd1['encl'] = dd1.map_partitions(enclosesAny, other=dd2)

生成实际结果。它包含来自dd1的行，其中 encl是正确的。为了不让结果与其他数据混淆列，我删除了encl列：

欢迎来到Stackoverflow，请提供您的数据帧示例和您预期的输出谢谢。我已经用示例输入和输出编辑了这个问题：-）

dd1['encl'] = dd1.map_partitions(enclosesAny, other=dd2)

result = dd1[dd1.encl].drop('encl', axis=1).compute()