Apache spark 在映射函数中引用RDD中的下一个条目

Apache spark 在映射函数中引用RDD中的下一个条目,apache-spark,Apache Spark,我有一个流要处理 例如,(为了简单起见,我们假设只有一个id) 假设TIMEOUT=5。因为在D发生后超过5秒没有任何进一步的事件发生,所以我想用两个键:值对将其映射到一个JavaPairDStream id1_1: A 1 B 2 C 4 D 7 及 但是,在我的

我有一个
流要处理

例如,(为了简单起见,我们假设只有一个id)

假设
TIMEOUT=5
。因为在D发生后超过5秒没有任何进一步的事件发生,所以我想用两个键:值对将其映射到一个
JavaPairDStream

id1_1:
A             1                 
B             2                 
C             4                 
D             7                 

但是,在我的匿名函数对象中,
PairFunction
传递给
mapToPair()
方法

incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;

@Override
public Tuple2<String, RequestData> call(String s) {
incomingMessages.mapToPair(新的PairFunction(){
私有静态最终长serialVersionUID=1L;
@凌驾
公共元组2调用(字符串s){
我无法在下一个条目中引用数据。换句话说,当我处理带有事件
D
的条目时,我无法查看
E
处的数据

如果这不是Spark,我可以简单地创建一个数组
timeDifferences
,将差异存储在两个相邻的时间戳中,并在
timeDifferences
中看到大于
TIMEOUT
的时差时将数组拆分为多个部分(尽管实际上不需要显式创建数组)


我如何在Spark中做到这一点?

我仍在努力理解您的问题,但根据您所写的内容,我认为您可以这样做:

  val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
  val B = A.map(x=>(x._1-1,x._2))
  val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
vala=sc.parallelize(列表((1,“A”,1.0),(1,“B”,2.0),(1,“C”,15.0)).zipWithIndex.map(x=>(x.。(u2,x.。(u1))
valb=A.map(x=>(x._1-1,x._2))
valc=A.leftOuterJoin(B).map(x=>(x.\u2.\u1,x.\u2.\u1.\u3-(x.\u2.\u2匹配{
案例部分(a)=>a._3
大小写=>0
})))
val group1=C.filter(x=>(x._2(x._2>5))

因此,概念是使用索引压缩以创建val A(它为RDD的每个条目分配一个序列长编号),并使用连续条目的索引复制RDD以创建val B(通过从索引中减去1),然后使用联接计算连续条目之间的超时。然后使用筛选器。此方法使用RDD。一种更简单的方法是将它们收集到Master中,然后使用Map或zip映射,但我想这将是scala而不是spark。

我相信这正是您需要的:

def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
    val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
    val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})

    // joining the two to attach a "followingGap" to each event
    val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
       case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
       case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
    })

    // collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
    // if this collection is very large, another join might be needed
    val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()

    // going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
    input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}

case class Event(timestamp: Long, data: String)

case class ExtendedEvent(event: Event, followingGap: Long)
def splitToTimeWindows(输入:RDD[Event],timeoutBetweenWindows:Long):RDD[Iterable[Event]={
val with index:RDD[(Long,Event)]=input.sortBy(u.timestamp).zipWithIndex().map(u.swap.cache())
valwithindexdrop1:RDD[(Long,Event)]=withIndex.map({case(i,e)=>(i-1,e)})
//将两者结合起来,为每项活动附加一个“以下间隙”
val extendedEvents:RDD[ExtendedEvent]=withIndex.leftOuterJoin(withIndexDrop1.map)({
案例(i,(当前,一些(下一个))=>ExtendedEvent(当前,下一个.timestamp-当前.timestamp)
案例(i,(当前,无))=>ExtendedEvent(当前,0)//最后一个事件没有以下间隔
})
//收集(到驱动程序内存!)截止点-在其窗口中*最后*的事件的时间戳
//如果此集合非常大,则可能需要另一个联接
val cutoffPoints=extendedEvents.collect({case e:ExtendedEvent if e.followGap>timeoutBetweenWindows=>e.event.timestamp})。distinct().collect()
//返回到原始输入,按每个事件最近的截止点分组(即,此事件窗口的开始n
input.groupBy(e=>cutoffPoints.filter(
第一部分基于的答案-将输入本身与1的偏移量连接起来,以计算每个记录的“以下差距”。然后我们收集窗口之间的“断点”或“截止点”,并使用这些点对输入执行另一次转换,以按窗口分组


注意:根据输入的特征,可能有更有效的方法来执行这些转换,例如:如果您有很多“会话”,此代码可能运行缓慢或内存不足。

谢谢,但元组表示用户操作,我基本上是在用户在一定时间内未处于活动状态时(我称之为
TIMEOUT
)为用户创建一个新的不同会话。因此,在我的示例中,如果
TIMEOUT=2
,我希望映射(a、B、C、D、E、F)分为((A,B)、(C)、(D)、(E,F))
  val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
  val B = A.map(x=>(x._1-1,x._2))
  val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
    val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
    val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})

    // joining the two to attach a "followingGap" to each event
    val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
       case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
       case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
    })

    // collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
    // if this collection is very large, another join might be needed
    val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()

    // going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
    input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}

case class Event(timestamp: Long, data: String)

case class ExtendedEvent(event: Event, followingGap: Long)