Java 如何在spark streaming中更新广播变量?
我相信,spark streaming有一个相对常见的用例: 我有一个对象流,我想根据一些参考数据进行过滤 最初,我认为使用广播变量实现这一点非常简单:Java 如何在spark streaming中更新广播变量?,java,scala,apache-spark,spark-streaming,broadcast,Java,Scala,Apache Spark,Spark Streaming,Broadcast,我相信,spark streaming有一个相对常见的用例: 我有一个对象流,我想根据一些参考数据进行过滤 最初,我认为使用广播变量实现这一点非常简单: public void启动停车引擎{ 广播路演 =sparkContext.broadcast(getRefData()); 最终JavadStreamFilteredStream=objectStream.filter(obj->{ 最终引用数据refData=refdataBroadcast.getValue(); 返回obj.getFie
public void启动停车引擎{
广播路演
=sparkContext.broadcast(getRefData());
最终JavadStreamFilteredStream=objectStream.filter(obj->{
最终引用数据refData=refdataBroadcast.getValue();
返回obj.getField().equals(refData.getField());
}
filteredStream.foreachRDD(rdd->{
rdd.foreach(目标->{
//过滤对象的最终处理
});
返回null;
});
}
然而,我的参考数据将周期性地改变,尽管这种改变并不频繁
我的印象是,我可以在驱动程序上修改并重新广播我的变量,并将其传播给每个工作人员,但是广播
对象不是可序列化的
,需要是最终的
我有什么选择?我能想到的三种解决方案是:
forEachPartition
或forEachRdd
中,使其完全驻留在worker上。但是,参考数据位于REST API中,因此我还需要以某种方式存储计时器/计数器,以停止流中每个元素的远程访问连接流
,这样我现在就可以流对,尽管这会将参考数据与每个对象一起发送不确定您是否已经尝试过此操作,但我认为可以在不关闭
SparkContext
的情况下实现对广播变量的更新。通过使用此方法,将在每个执行器上删除广播变量的副本,并且需要重新广播,以便再次访问。对于您的在用例中,当您想要更新广播时,您可以:
我从中获取了大量信息,但上一次回复的人声称已经在本地工作。需要注意的是,您可能希望在unpersist上将blocking设置为
true
,以便确保执行者清除了旧数据(以便在下一次迭代中不再读取过时的值).几乎每一个处理流媒体应用程序的人都需要一种方法将(过滤、查找等)参考数据(来自数据库、文件等)编织到流媒体数据中。我们有一个完整的两部分的部分解决方案
- 使用所需的缓存TTL创建CacheLookup对象
- 用广播把它包起来
- 将CacheLookup用作流逻辑的一部分
问题的第二部分仍然没有解决。我想知道是否有可行的方法来解决这个问题。扩展@Rohan Aletty的答案。下面是一个BroadcastWrapper的示例代码,它根据一些ttl刷新广播变量
public class BroadcastWrapper {
private Broadcast<ReferenceData> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms
if (var != null)
var.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
//Your logic to refresh
ReferenceData data = getRefData();
var = getSparkContext(sparkContext).broadcast(data);
}
return var;
}
}
公共类包装器{
私人广播;
private Date lastUpdatedAt=Calendar.getInstance().getTime();
私有静态BroadcastWrapper obj=新的BroadcastWrapper();
私有广播包装器(){}
公共静态广播包装器getInstance(){
返回obj;
}
公共JavaSparkContext获取SparkContext(SparkContext sc){
JavaSparkContext jsc=JavaSparkContext.fromSparkContext(sc);
返回jsc;
}
公共广播更新网页(SparkContext SparkContext){
Date currentDate=Calendar.getInstance().getTime();
long diff=currentDate.getTime()-lastUpdatedAt.getTime();
如果(var==null | | diff>60000){//,假设我们希望每1分钟刷新一次=60000毫秒
如果(var!=null)
变量unpersist();
lastUpdatedAt=新日期(System.currentTimeMillis());
//你的逻辑需要更新
ReferenceData=getRefData();
var=getSparkContext(sparkContext).broadcast(数据);
}
收益var;
}
}
您的代码如下所示:
public void startSparkEngine() {
final JavaDStream<MyObject> filteredStream = objectStream.transform(stream -> {
Broadcast<ReferenceData> refdataBroadcast = BroadcastWrapper.getInstance().updateAndGet(stream.context());
stream.filter(obj -> obj.getField().equals(refdataBroadcast.getValue().getField()));
});
filteredStream.foreachRDD(rdd -> {
rdd.foreach(obj -> {
// Final processing of filtered objects
});
return null;
});
}
public void startSparkEngine(){
最终JavadStreamFilteredStream=objectStream.transform(流->{
Broadcast refdataBroadcast=BroadcastWrapper.getInstance().updateAndGet(stream.context());
stream.filter(obj->obj.getField().equals(refdataBroadcast.getValue().getField());
});
filteredStream.foreachRDD(rdd->{
rdd.foreach(目标->{
//过滤对象的最终处理
});
返回null;
});
}
这在多集群上也适用于我。
希望这有助于最近面临的问题。认为这可能对scala用户有所帮助 Scala执行
BroadCastWrapper
的方法如下所示
import java.io.{ ObjectInputStream, ObjectOutputStream }
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.StreamingContext
import scala.reflect.ClassTag
/* wrapper lets us update brodcast variables within DStreams' foreachRDD
without running into serialization issues */
case class BroadcastWrapper[T: ClassTag](
@transient private val ssc: StreamingContext,
@transient private val _v: T) {
@transient private var v = ssc.sparkContext.broadcast(_v)
def update(newValue: T, blocking: Boolean = false): Unit = {
v.unpersist(blocking)
v = ssc.sparkContext.broadcast(newValue)
}
def value: T = v.value
private def writeObject(out: ObjectOutputStream): Unit = {
out.writeObject(v)
}
private def readObject(in: ObjectInputStream): Unit = {
v = in.readObject().asInstanceOf[Broadcast[T]]
}
}
每次你需要c
import java.io.{ ObjectInputStream, ObjectOutputStream }
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.StreamingContext
import scala.reflect.ClassTag
/* wrapper lets us update brodcast variables within DStreams' foreachRDD
without running into serialization issues */
case class BroadcastWrapper[T: ClassTag](
@transient private val ssc: StreamingContext,
@transient private val _v: T) {
@transient private var v = ssc.sparkContext.broadcast(_v)
def update(newValue: T, blocking: Boolean = false): Unit = {
v.unpersist(blocking)
v = ssc.sparkContext.broadcast(newValue)
}
def value: T = v.value
private def writeObject(out: ObjectOutputStream): Unit = {
out.writeObject(v)
}
private def readObject(in: ObjectInputStream): Unit = {
v = in.readObject().asInstanceOf[Broadcast[T]]
}
}
package com.databroccoli.streaming.dimensionupateinstreaming
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{DataFrame, ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.functions.{broadcast, expr}
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
object RefreshDimensionInStreaming {
def main(args: Array[String]) = {
@transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("com.amazonaws").setLevel(Level.ERROR)
Logger.getLogger("com.amazon.ws").setLevel(Level.ERROR)
Logger.getLogger("io.netty").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
val schemaUntyped1 = StructType(
Array(
StructField("id", StringType),
StructField("customrid", StringType),
StructField("customername", StringType),
StructField("countrycode", StringType),
StructField("timestamp_column_fin_1", TimestampType)
))
val schemaUntyped2 = StructType(
Array(
StructField("id", StringType),
StructField("countrycode", StringType),
StructField("countryname", StringType),
StructField("timestamp_column_fin_2", TimestampType)
))
val factDf1 = spark.readStream
.schema(schemaUntyped1)
.option("header", "true")
.csv("src/main/resources/broadcasttest/fact")
var countryDf: Option[DataFrame] = None: Option[DataFrame]
def updateDimensionDf() = {
val dimDf2 = spark.read
.schema(schemaUntyped2)
.option("header", "true")
.csv("src/main/resources/broadcasttest/dimension")
if (countryDf != None) {
countryDf.get.unpersist()
}
countryDf = Some(
dimDf2
.withColumnRenamed("id", "id_2")
.withColumnRenamed("countrycode", "countrycode_2"))
countryDf.get.show()
}
factDf1.writeStream
.outputMode("append")
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.show(10)
updateDimensionDf()
batchDF
.join(
countryDf.get,
expr(
"""
countrycode_2 = countrycode
"""
),
"leftOuter"
)
.show
}
.start()
.awaitTermination()
}
}