Apache spark 火花：MAPWITHESTE后的输出操作_Apache Spark_Spark Streaming

Apache spark 火花：MAPWITHESTE后的输出操作

apache-spark

Apache spark 火花：MAPWITHESTE后的输出操作,apache-spark,spark-streaming,Apache Spark,Spark Streaming,我们有spark streaming应用程序，其中我们使用Kafka的事件。我们希望通过每个事件中的traceid聚合一段时间内的事件，并为该traceid创建聚合事件，并将聚合事件写入数据库我们的活动是这样的 traceid: 123 { info: abc; } traceid: 123 { info:bcd; } 现在我们想要实现的是在一段时间内创建一个聚合事件，比如说2分钟，然后将聚合事件写入数据库，而不是单个事件的 traceid: 123 { info:abc,b

我们有spark streaming应用程序，其中我们使用Kafka的事件。我们希望通过每个事件中的traceid聚合一段时间内的事件，并为该traceid创建聚合事件，并将聚合事件写入数据库

我们的活动是这样的

traceid: 123
{
  info: abc;
}

traceid: 123
{
  info:bcd;
}

现在我们想要实现的是在一段时间内创建一个聚合事件，比如说2分钟，然后将聚合事件写入数据库，而不是单个事件的

traceid: 123 { info:abc,bcd }
我们使用了mapwithState并产生了此代码

def trackStateFunc(batchTime: Time, id: String, url: Option[MetricTypes.EnrichedKeyType], state: State[SessionData]): Option[(String, String, Long, immutable.Map[String, String])] = { val enrichedId = id var accountId:String = null var reducedText:String = null var commonIDS:String = null var deviceId:String = null var ets:Long = 0 var eventId:String = null if (url.isDefined) { accountId = url.get._1.asInstanceOf[String] reducedText = url.get._2.asInstanceOf[String] commonIDS = url.get._3.asInstanceOf[String] deviceId = url.get._4.asInstanceOf[String] ets = url.get._5.toString.toLong eventId = url.get._6.asInstanceOf[String] val attributeMap = Map( eventId -> reducedText, "common_ids" -> commonIDS, "common_enriched_physicalDeviceId" -> deviceId ) if (state.exists) { val newState = state.get.attributeMap ++ attributeMap state.update(SessionData(newState)) Some(accountId, enrichedId, ets, newState) } else { state.update(SessionData(attributeMap)) Some(accountId, enrichedId, ets, attributeMap) } } else { None } } val stateSpec = StateSpec.function(trackStateFunc _).timeout(Minutes(2)). val requestsWithState = tempLines.mapWithState(stateSpec) requestsWithState.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = createNewConnection() partitionOfRecords.foreach(record => { record match { case (accountId, enrichedId, ets, attributeMap) => if (validateRecordForStorage(accountId, enrichedId, ets, attributeMap)) { val ds = new DBDataStore(connection) ds.saveEnrichedEvent(accountId, enrichedId, ets, attributeMap) //val r = scala.util.Random } else { /*logError("Discarded record [enrichedId=" + enrichedId + ", accountId=" + accountId + ", ets=" + ets + ", attributes=" + attributeMap.toString() + "]")*/ println("Discarded record [enrichedId=" + enrichedId + ", accountId=" + accountId + ", ets=" + ets + "]") null } case default => { logInfo("You gave me: " + default) null } } } ) } }

mapwithState聚合很好…但我们的理解是..它应该在2分钟后开始写入数据库，但是..注意它立即开始写入数据库，而无需等待2分钟…因此，如果有人能指导我们实现目标，我们的理解肯定是不正确的仅在聚合2分钟后写入数据库的目标将大大有助于添加跟踪状态function@YuvalItzchakov添加了trackstateFunc如果您需要更多信息，请告知我们添加您的跟踪状态function@YuvalItzchakov添加了trackstateFunc如果需要更多信息，请告知我们