Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何用Keras模型实现实时推理的Spark?_Apache Spark_Keras_Pyspark_Apache Spark Sql_Streaming - Fatal编程技术网

Apache spark 如何用Keras模型实现实时推理的Spark?

Apache spark 如何用Keras模型实现实时推理的Spark?,apache-spark,keras,pyspark,apache-spark-sql,streaming,Apache Spark,Keras,Pyspark,Apache Spark Sql,Streaming,下面是我的一段代码: @pandas_udf(StringType()) def online_predict(values: pd.Series) -> pd.Series: pred = Model.from_config(bc_config.value) pred.set_weights(bc_weights.value) ds = tf.data.Dataset.from_tensor_slices(values) ds = ds.map(prepr

下面是我的一段代码:

@pandas_udf(StringType())
def online_predict(values: pd.Series) -> pd.Series:
    pred = Model.from_config(bc_config.value)
    pred.set_weights(bc_weights.value)
    ds = tf.data.Dataset.from_tensor_slices(values)
    ds = ds.map(preprocessing).batch(batch_size)
    res = pred.predict(ds)
    res = tf.norm(res, axis=1)
    # res = tf.greater(res, 5.0)
    res = tf.strings.as_string(res).numpy()
    return pd.Series(res)


spark = SparkSession.builder.appName(
    'spark_tf').master("local[*]").getOrCreate()
weights = np.load('./ext/weights.npy', allow_pickle=True)
config = np.load('./ext/config.npy', allow_pickle=True).item()
bc_weights = spark.sparkContext.broadcast(weights)
bc_config = spark.sparkContext.broadcast(config)

stream = spark.readStream.format('kafka') \
    .option('kafka.bootstrap.servers', 'localhost:9092') \
    .option('subscribe', 'dlpred') \
    .load()

stream = stream.select(online_predict(col('value')).alias('value'))

x = stream.writeStream \
    .format('kafka') \
    .option("kafka.bootstrap.servers", 'localhost:9092') \
    .option('topic', 'dltest') \
    .option('checkpointLocation', './kafka_checkpoint') \
    .start()

x.awaitTermination()
因此,我的工作流程基本上是:

  • 广播模型的权重和配置
  • 从Kafka初始化PySpark结构化流媒体管道,然后在其上应用Pandas UDF
  • 通过Pypark Kafka接收器将消息发送回Kafka
  • 这是个好习惯吗?我在Pandas UDF中初始化我的模型,因为我认为Spark cluster处理Pandas UDF,所以在Pandas UDF之外初始化模型是没有意义的,即使使用广播权重和配置,因为Spark cluster不会将模型缩放到其工作对象

    据我所知,每当有新行出现时,PySpark都会将UDF应用于每一行,因此模型初始化将重复,不是吗?当信息反复出现时,我也会收到警告。总的来说,我在结构化流媒体和Spark方面的经验很少,所以我不知道它是否得到了正确的实现