Dataframe Julia中扩展数据帧的有效方法

Dataframe Julia中扩展数据帧的有效方法,dataframe,julia,Dataframe,Julia,我有一个数据框,每个案例都有暴露事件: using DataFrames using Dates df = DataFrame(id = [1,1,2,3], startdate = [Date(2018,3,1),Date(2019,4,2),Date(2018,6,4),Date(2018,5,1)], enddate = [Date(2019,4,4),Date(2019,8,5),Date(2019,3,1),Date(2019,4,15)]) 我希望将每一集扩展到其组成的天数,消除因

我有一个数据框,每个案例都有暴露事件:

using DataFrames
using Dates
df = DataFrame(id = [1,1,2,3], startdate = [Date(2018,3,1),Date(2019,4,2),Date(2018,6,4),Date(2018,5,1)], enddate = [Date(2019,4,4),Date(2019,8,5),Date(2019,3,1),Date(2019,4,15)])
我希望将每一集扩展到其组成的天数,消除因重叠集(示例数据框中的案例1)而导致的每个案例的重复天数:

我强烈怀疑有一种比我想出的暴力方法更有效的方法,任何帮助都将受到感谢


实施说明:在实际实施过程中,有几十万个病例,每个病例的发作次数相对较少(中位数=1,75%3,最多20次),但暴露时间跨度为20年或更长,导致数据集非常大(数亿条记录)。为了适应可用内存,我已按id对数据集进行了分区,并使用Threads@Threads宏并行循环分区。将数据分解为天的主要目的不仅仅是消除重叠,而是将数据与每天可用的其他暴露数据连接起来。

以下是一些有效的方法:

dfs = sort(df, [:startdate, order(:enddate, rev=true)])
gdf = groupby(dfs, :id, sort=true)

function process(startdate, enddate)
    start = startdate[1]
    stop = enddate[1]
    res_daydate = collect(start:Day(1):stop)
    res_startdate = fill(start, length(res_daydate))
    res_enddate = fill(stop, length(res_daydate))

    for i in 2:length(startdate)
        if startdate[i] > res_daydate[end]
            start = startdate[i]
            stop = enddate[i]
        elseif enddate[i] > res_daydate[end]
            start = res_daydate[end] + Day(1)
            stop = enddate[i]
        end
        new_daydate = start:Day(1):stop
        append!(res_daydate, new_daydate)
        append!(res_startdate, fill(startdate[i], length(new_daydate)))
        append!(res_enddate, fill(stop, length(new_daydate)))
    end

    return (startdate=res_startdate, enddate=res_enddate, daydate=res_daydate)
end

combine(gdf, [:startdate, :enddate] => process => AsTable)

(但是,如果它是正确的,请使用更大的数据对照您的实现进行检查,因为我刚刚快速编写了它,向您展示了如何使用DataFrames.jl执行性能实现)

下面是一个更完整的解决方案,它考虑了一些基本的细节。每一集都与其他属性相关联,例如我使用locationid(曝光发生的地点)和需要指示后续集之间是否存在间隔的情况。最初的解决方案也不适用于一个情节完全包含在另一个情节中的特殊情况——此类情节不应扩大

using Dates
using DataFrames

function process(startdate, enddate, locationid)
    start = startdate[1]
    stop = enddate[1]
    location = locationid[1]
   res_daydate = collect(start:Day(1):stop)
    res_startdate = fill(start, length(res_daydate))
    res_enddate = fill(stop, length(res_daydate))
    res_location = fill(location, length(res_daydate))
    gap = 0
    res_gap = fill(0, length(res_daydate))
    for i in 2:length(startdate)
        if startdate[i] > res_daydate[end]
            start = startdate[i]
        elseif enddate[i] > res_daydate[end]
            start = res_daydate[end] + Day(1)
        else
            continue #this episode is contained within the previous episode
        end
        if  start - res_daydate[end] > Day(1)
            gap = gap==0 ? 1 : 0
        end 
        stop = enddate[i]
        location = locationid[i]
        new_daydate = start:Day(1):stop
        append!(res_daydate, new_daydate)
        append!(res_startdate, fill(startdate[i], length(new_daydate)))
        append!(res_enddate, fill(stop, length(new_daydate)))
        append!(res_location, fill(location, length(new_daydate)))
        append!(res_gap, fill(gap, length(new_daydate)))
    end

    return (daydate=res_daydate, startdate=res_startdate, enddate=res_enddate, locationid=res_location, gap = res_gap)
end

function eliminateoverlap()
    df = DataFrame(id = [1,1,2,3,3,4,4], startdate = [Date(2018,3,1),Date(2019,4,2),Date(2018,6,4),Date(2018,5,1), Date(2019,5,1), Date(2012,1,1), Date(2012,2,2)], 
                   enddate = [Date(2019,4,4),Date(2019,8,5),Date(2019,3,1),Date(2019,4,15),Date(2019,6,15),Date(2012,6,30), Date(2012,2,10)], locationid=[10,11,21,30,30,40,41])
    dfs = sort(df, [:startdate, order(:enddate, rev=true)])
    gdf = groupby(dfs, :id, sort=true)
    r = combine(gdf, [:startdate, :enddate, :locationid] => process => AsTable)
    df = combine(groupby(r, [:id,:gap,:locationid]), :daydate => minimum => :StartDate, :daydate => maximum => :EndDate)
    return df
end

df = eliminateoverlap()

你的说明书不完整。为了帮助您,我需要知道两集或多集重叠的情况:1)您希望保留哪一集
startdate
enddatae
?2) 对于要保留在
start
end
中的剧集,值
true
(因为现在只保留一行,所以这些列中剩余的内容是随机的,取决于剧集在原始数据框中的显示顺序)。抱歉,我在
唯一之前省略了排序步骤!(..
step.
sort!(s,[:id,:DayDate,:startdate,order(:enddate,rev=true)])
。因此,选择的日期是从第一集开始的日期,如果出现平局,则是从最后一集结束的日期。可以删除开始和结束标志,实际上只有在这两个标志处才能指示一天日期的间隔,其中连续的扩展集彼此不相邻。我使用了两个数据集(a:127K集=>160M天,B:280K集=>800M天).A对原始蛮力算法采用47秒对81秒。B对4M23秒对4M24秒。实际实现显示出更大的差异,A和320s分别采用52秒对195秒,B分别采用1446秒。内存分配存在很大差异:973M分配总计52GB对5G分配总计108G,A和5G分配总计268GB对26G分配总计521GB。我使用的是AMD Ryzen 3800X,64GB RAM运行Win10。julia处理巨大数据集的速度惊人。非常感谢您的帮助。
using Dates
using DataFrames

function process(startdate, enddate, locationid)
    start = startdate[1]
    stop = enddate[1]
    location = locationid[1]
   res_daydate = collect(start:Day(1):stop)
    res_startdate = fill(start, length(res_daydate))
    res_enddate = fill(stop, length(res_daydate))
    res_location = fill(location, length(res_daydate))
    gap = 0
    res_gap = fill(0, length(res_daydate))
    for i in 2:length(startdate)
        if startdate[i] > res_daydate[end]
            start = startdate[i]
        elseif enddate[i] > res_daydate[end]
            start = res_daydate[end] + Day(1)
        else
            continue #this episode is contained within the previous episode
        end
        if  start - res_daydate[end] > Day(1)
            gap = gap==0 ? 1 : 0
        end 
        stop = enddate[i]
        location = locationid[i]
        new_daydate = start:Day(1):stop
        append!(res_daydate, new_daydate)
        append!(res_startdate, fill(startdate[i], length(new_daydate)))
        append!(res_enddate, fill(stop, length(new_daydate)))
        append!(res_location, fill(location, length(new_daydate)))
        append!(res_gap, fill(gap, length(new_daydate)))
    end

    return (daydate=res_daydate, startdate=res_startdate, enddate=res_enddate, locationid=res_location, gap = res_gap)
end

function eliminateoverlap()
    df = DataFrame(id = [1,1,2,3,3,4,4], startdate = [Date(2018,3,1),Date(2019,4,2),Date(2018,6,4),Date(2018,5,1), Date(2019,5,1), Date(2012,1,1), Date(2012,2,2)], 
                   enddate = [Date(2019,4,4),Date(2019,8,5),Date(2019,3,1),Date(2019,4,15),Date(2019,6,15),Date(2012,6,30), Date(2012,2,10)], locationid=[10,11,21,30,30,40,41])
    dfs = sort(df, [:startdate, order(:enddate, rev=true)])
    gdf = groupby(dfs, :id, sort=true)
    r = combine(gdf, [:startdate, :enddate, :locationid] => process => AsTable)
    df = combine(groupby(r, [:id,:gap,:locationid]), :daydate => minimum => :StartDate, :daydate => maximum => :EndDate)
    return df
end

df = eliminateoverlap()