Apache spark 在映射函数中引用RDD中的下一个条目_Apache Spark

Apache spark 在映射函数中引用RDD中的下一个条目

apache-spark

Apache spark 在映射函数中引用RDD中的下一个条目,apache-spark,Apache Spark,我有一个流要处理例如，（为了简单起见，我们假设只有一个id）假设TIMEOUT=5。因为在D发生后超过5秒没有任何进一步的事件发生，所以我想用两个键：值对将其映射到一个JavaPairDStream id1_1: A 1 B 2 C 4 D 7 及但是，在我的

我有一个

流要处理

例如，（为了简单起见，我们假设只有一个id）

假设

TIMEOUT=5

。因为在D发生后超过5秒没有任何进一步的事件发生，所以我想用两个键：值对将其映射到一个

JavaPairDStream

id1_1:
A             1                 
B             2                 
C             4                 
D             7

及

但是，在我的匿名函数对象中，

PairFunction

传递给

mapToPair（）

方法

incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;

@Override
public Tuple2<String, RequestData> call(String s) {

incomingMessages.mapToPair（新的PairFunction（）{
私有静态最终长serialVersionUID=1L；
@凌驾
公共元组2调用（字符串s）{

我无法在下一个条目中引用数据。换句话说，当我处理带有事件

的条目时，我无法查看

处的数据

如果这不是Spark，我可以简单地创建一个数组

timeDifferences

，将差异存储在两个相邻的时间戳中，并在

timeDifferences

中看到大于

TIMEOUT

的时差时将数组拆分为多个部分（尽管实际上不需要显式创建数组）

我如何在Spark中做到这一点？

我仍在努力理解您的问题，但根据您所写的内容，我认为您可以这样做：

  val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
  val B = A.map(x=>(x._1-1,x._2))
  val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))

vala=sc.parallelize（列表（（1，“A”，1.0），（1，“B”，2.0），（1，“C”，15.0））.zipWithIndex.map（x=>（x.。（u2，x.。（u1））
valb=A.map（x=>（x._1-1，x._2））
valc=A.leftOuterJoin（B）.map（x=>（x.\u2.\u1，x.\u2.\u1.\u3-（x.\u2.\u2匹配{
案例部分（a）=>a._3
大小写=>0
})))
val group1=C.filter（x=>（x._2（x._2>5））

因此，概念是使用索引压缩以创建val A（它为RDD的每个条目分配一个序列长编号），并使用连续条目的索引复制RDD以创建val B（通过从索引中减去1），然后使用联接计算连续条目之间的超时。然后使用筛选器。此方法使用RDD。一种更简单的方法是将它们收集到Master中，然后使用Map或zip映射，但我想这将是scala而不是spark。

我相信这正是您需要的：

def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
    val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
    val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})

    // joining the two to attach a "followingGap" to each event
    val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
       case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
       case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
    })

    // collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
    // if this collection is very large, another join might be needed
    val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()

    // going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
    input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}

case class Event(timestamp: Long, data: String)

case class ExtendedEvent(event: Event, followingGap: Long)

def splitToTimeWindows（输入：RDD[Event]，timeoutBetweenWindows:Long）：RDD[Iterable[Event]={
val with index:RDD[（Long，Event）]=input.sortBy（u.timestamp）.zipWithIndex（）.map（u.swap.cache（））
valwithindexdrop1:RDD[（Long，Event）]=withIndex.map（{case（i，e）=>（i-1，e）}）
//将两者结合起来，为每项活动附加一个“以下间隙”
val extendedEvents:RDD[ExtendedEvent]=withIndex.leftOuterJoin（withIndexDrop1.map）({
案例（i，（当前，一些（下一个））=>ExtendedEvent（当前，下一个.timestamp-当前.timestamp）
案例（i，（当前，无））=>ExtendedEvent（当前，0）//最后一个事件没有以下间隔
})
//收集（到驱动程序内存！）截止点-在其窗口中*最后*的事件的时间戳
//如果此集合非常大，则可能需要另一个联接
val cutoffPoints=extendedEvents.collect（{case e:ExtendedEvent if e.followGap>timeoutBetweenWindows=>e.event.timestamp}）。distinct（）.collect（）
//返回到原始输入，按每个事件最近的截止点分组（即，此事件窗口的开始n
input.groupBy（e=>cutoffPoints.filter（


第一部分基于的答案-将输入本身与1的偏移量连接起来，以计算每个记录的“以下差距”。然后我们收集窗口之间的“断点”或“截止点”，并使用这些点对输入执行另一次转换，以按窗口分组
注意：根据输入的特征，可能有更有效的方法来执行这些转换，例如：如果您有很多“会话”，此代码可能运行缓慢或内存不足。
谢谢，但元组表示用户操作，我基本上是在用户在一定时间内未处于活动状态时（我称之为TIMEOUT
）为用户创建一个新的不同会话。因此，在我的示例中，如果TIMEOUT=2，我希望映射（a、B、C、D、E、F）分为（（A，B）、（C）、（D）、（E，F））
  val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
  val B = A.map(x=>(x._1-1,x._2))
  val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))

def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
    val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
    val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})

    // joining the two to attach a "followingGap" to each event
    val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
       case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
       case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
    })

    // collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
    // if this collection is very large, another join might be needed
    val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()

    // going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
    input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}

case class Event(timestamp: Long, data: String)

case class ExtendedEvent(event: Event, followingGap: Long)