Apache spark Spark Streaming:如何在流上加载管道?

Apache spark Spark Streaming:如何在流上加载管道?,apache-spark,pyspark,spark-streaming,apache-spark-mllib,Apache Spark,Pyspark,Spark Streaming,Apache Spark Mllib,我正在实现一个用于流处理的lambda体系结构系统 在Spark Batch中使用GridSearch创建管道没有问题: pipeline = Pipeline(stages=[data1_indexer, data2_indexer, ..., assembler, logistic_regressor]) paramGrid = ( ParamGridBuilder() .addGrid(logistic_regressor.regParam, (0.01, 0.1)) .addGrid(

我正在实现一个用于流处理的lambda体系结构系统

在Spark Batch中使用GridSearch创建管道没有问题:

pipeline = Pipeline(stages=[data1_indexer, data2_indexer, ..., assembler, logistic_regressor])

paramGrid = (
ParamGridBuilder()
.addGrid(logistic_regressor.regParam, (0.01, 0.1))
.addGrid(logistic_regressor.tol, (1e-5, 1e-6))
...etcetera
).build()

cv = CrossValidator(estimator=pipeline,
                estimatorParamMaps=paramGrid,
                evaluator=BinaryClassificationEvaluator(),
                numFolds=4)

pipeline_cv = cv.fit(raw_train_df)
model_fitted = pipeline_cv.getEstimator().fit(raw_validation_df)
model_fitted.write().overwrite().save("pipeline")
然而,我似乎找不到如何在火花流过程中堵塞管道。我使用kafka作为数据流源,目前我的代码如下:

import json
from pyspark.ml import PipelineModel
from pyspark.streaming.kafka import KafkaUtils
从pyspark.streaming导入StreamingContext

ssc = StreamingContext(sc, 1)
kafkaStream = KafkaUtils.createStream(ssc,  "localhost:2181", "spark-    streaming-consumer", {"kafka_topic": 1})

model = PipelineModel.load('pipeline/')
parsed_stream = kafkaStream.map(lambda x: json.loads(x[1]))

CODE MISSING GOES HERE    

ssc.start()
ssc.awaitTermination()
现在我需要想办法

根据文档(尽管它看起来非常过时),您的模型似乎需要实现该方法,以便能够在rdd对象上使用它(希望是在kafkastream上)

如何在流媒体上下文中使用管道?重新加载的PipelineModel似乎只实现了


这是否意味着在流式处理上下文中使用批处理模型的唯一方法是使用纯模型,而不使用管道?

我找到了一种将Spark管道加载到Spark流式处理中的方法

此解决方案适用于Spark v2.0,因为进一步的版本可能会实现更好的解决方案

我找到的解决方案使用
toDF()
方法将流式RDD转换为数据帧,然后可以在其中应用
pipeline.transform
方法

但这种做法效率极低

# we load the required libraries
from pyspark.sql.types import (
        StructType, StringType, StructField, LongType
        )
from pyspark.sql import Row
from pyspark.streaming.kafka import KafkaUtils

#we specify the dataframes schema, so spark does not have to do reflections on the data.

pipeline_schema = StructType(
    [
        StructField("field1",StringType(),True),
        StructField("field2",StringType(),True),
        StructField("field3", LongType(),True)
 ]
)

#We load the pipeline saved with spark batch
pipeline = PipelineModel.load('/pipeline')

#Setup usual spark context, and spark Streaming Context
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)

#On my case I use kafka directKafkaStream as the DStream source
directKafkaStream = KafkaUtils.createDirectStream(ssc, suwanpos[QUEUE_NAME], {"metadata.broker.list": "localhost:9092"})

def handler(req_rdd):
    def process_point(p):
        #here goes the logic to  do after applying the pipeline
        print(p)   
    if req_rdd.count()  > 0:
        #Here is the gist of it, we turn the rdd into a Row, then into a df with the specified schema)
        req_df = req_rdd.map(lambda r: Row(**r)).toDF(schema=pipeline_schema)
        #Now we can apply the transform, yaaay
        pred = pipeline.transform(req_df)
        records = pred.rdd.map(lambda p: process_point(p)).collect()
希望这有帮助