Scala Spark处理用于具有持续时间的项目的统计分析_Scala_Sorting_Apache Spark_Spark Dataframe_User Defined Functions

Scala Spark处理用于具有持续时间的项目的统计分析

scala sorting apache-spark

Scala Spark处理用于具有持续时间的项目的统计分析,scala,sorting,apache-spark,spark-dataframe,user-defined-functions,Scala,Sorting,Apache Spark,Spark Dataframe,User Defined Functions,我有一个spark数据框，其中的项目具有一定的持续时间；它们在“开始”时上升，在“停止”时结束现在，我想计算每次发生变化时所有活动ID的平均值（或稍后更复杂的值）。问题是，原始数据集大约是30Mio行，所以我确实需要使用spark（2.0）像这样的蛮力方法（这里用类似熊猫的伪代码编写）效率非常低： Result = [] for type, subDf in df.groupby("type"): #for every type # get all times where chag

我有一个spark数据框，其中的项目具有一定的持续时间；它们在“开始”时上升，在“停止”时结束

现在，我想计算每次发生变化时所有活动ID的平均值（或稍后更复杂的值）。问题是，原始数据集大约是30Mio行，所以我确实需要使用spark（2.0）

像这样的蛮力方法（这里用类似熊猫的伪代码编写）效率非常低：

Result = []
for type, subDf in df.groupby("type"): #for every type

    # get all times where chages happen
    changeTimes = union( subDf["start"], subDf["stop"]).sort()

    #loop through all times where an item starts or stops
    for t in changeTimes:

        # filter out all the active items
        subDfTime = subDf[ sunDF.start < t && sunDF.stop > t ]

        # calculate their mean
        result = subDfTime.value.mean()

        # return the time, the type and the mean of the values
        Result.append( [time, type, result] )

所以这个想法基于搜索排序，应该是log（N）操作，然后通过一个也应该是log（N）的索引索引原始数据帧。所以我猜这应该更快（事实上，在熊猫身上也是如此）

2：第二个想法是对开始时间和停止时间进行排序，并对它们进行迭代——每次要么接收一个新项目（如果是开始时间），要么删除一个项目（如果是停止时间）。（我已经问过如何在spark中实现这一点，并得到了良好的评价）

所以这个想法的作用是这样的：

Result = []
for type, subDf in df.groupby("type"): #for every type

    # get all times where cahnges happen and add a direction 
    # true for an id getting in, false for an id popping out
    start_times = subDf[["id", "start"]].rename("start", "time")
    start_times["direction"] = true
    stop_times = subDf[["id", "stop"]].rename("stop", "time")
    stop_times["direction"] = false

    #sort the united list by time
    changeTimes= union( start_times, stop_times ).sort_values("time")

    #set of ids that are valid for each loop
    valid_ids_t = Set([])

    for id,time,direction in changeTimes:
        #if direction is true it means we are looking at a starting 
        #time and this we put the id in the list 
        #(==true is unnecessarily explicit I know, but it is explicit)
        if direction == true: valid_ids + id
        #if not it is a stop time and thus we remove it from the list
        else: valid_ids - id

        # filter out all the active items
        subDfTime = subDf.index( valid_ids_t )

        # calculate their mean
        result = subDfTime.value.mean()

        # return the time, the type and the mean of the values
        Result.append( [time, type, result] )

所以我的问题是，这些技术中的哪一种适用于scala/spark？是否有“搜索排序”方法？是否有比暴力暴力“.filter（）”（可能使用某种散列索引）更有效的方法通过索引过滤spark数据帧，因为单个项目需要很长时间：

val fset = List(18850L, 30929L, 46538L, 50405L, 57596L, 59917L)
subDf.filter( $"id".isin(fset: _*) ).show()

以及

subProp.filter( $"startTime" < lit(30642238068L) && $"stopTime" > lit(30642238068L) ).show()

subop.filter（$“startTime”lit（30642238068L））.show（）

对于并行性来说，如果每个“groupBy（'type'）”都在一个worker上运行，这将是非常好的——这不需要并行化

我也很感谢其他提示——这似乎是一个以前可能解决过的问题，但显然我缺乏成功的谷歌搜索所需的正确术语、名称和上下文。

如果有什么不可理解的地方，请告诉我，我会改进这个问题。我知道我要解决的不是一个简单的转变。我认为你的问题陈述很清楚。我理解您描述的两种方法（但我不确定如果direction==true:valid\uids+id，否则：valid\uids+id语句的目的是什么）。还想知道如何在Spark中翻译您的实现！：-）@Alexandre感谢您的提问并找到了bug。我已经添加了一些注释，以使其更清晰。如果有不可理解的地方，请告诉我，我会改进这个问题。我知道我要解决的不是一个简单的转变。我认为你的问题陈述很清楚。我理解您描述的两种方法（但我不确定如果direction==true:valid\uids+id，否则：valid\uids+id语句的目的是什么）。还想知道如何在Spark中翻译您的实现！：-）@Alexandre感谢您的提问并找到了bug。我添加了一些注释，以使其更清楚

val fset = List(18850L, 30929L, 46538L, 50405L, 57596L, 59917L)
subDf.filter( $"id".isin(fset: _*) ).show()

subProp.filter( $"startTime" < lit(30642238068L) && $"stopTime" > lit(30642238068L) ).show()