pyspark-kafka流数据处理程序

pyspark-kafka流数据处理程序,pyspark,spark-streaming,Pyspark,Spark Streaming,我将spark 2.3.2与pyspark一起使用,刚刚发现foreach和foreachBatch在此配置中的“DataStreamWriter”对象中不可用。问题是公司Hadoop是2.6,spark 2.4(提供了我所需要的)不起作用(SparkSession正在崩溃)。还有其他方法可以将数据发送到自定义处理程序并处理流数据吗 这是我到目前为止的代码: def streamLoad(self,customHandler): options = self.options

我将spark 2.3.2与pyspark一起使用,刚刚发现
foreach
foreachBatch
在此配置中的“DataStreamWriter”对象中不可用。问题是公司Hadoop是2.6,spark 2.4(提供了我所需要的)不起作用(SparkSession正在崩溃)。还有其他方法可以将数据发送到自定义处理程序并处理流数据吗

这是我到目前为止的代码:

    def streamLoad(self,customHandler):
    options = self.options

    self.logger.info("Recuperando o schema baseado na estrutura do JSON")
    jsonStrings = ['{"sku":"9","ean":"4","name":"DVD","description":"foo description","categories":[{"code":"M02_BLURAY_E_DVD_PLAYER"}],"attributes":[{"name":"attrTeste","value":"Teste"}]}']
    myRDD = self.spark.sparkContext.parallelize(jsonStrings)
    jsonSchema = self.spark.read.json(myRDD).schema # Maybe there is a way to serialize this 

    self.logger.info("Iniciando o streaming no Kafka[opções: {}]".format(str(options)))
    df = self.spark \
        .readStream \
        .format("kafka") \
        .option("maxFilesPerTrigger", 1) \
        .option("kafka.bootstrap.servers", options["kafka.bootstrap.servers"]) \
        .option("startingOffsets", options["startingOffsets"]) \
        .option("subscribe", options["subscribe"]) \
        .option("failOnDataLoss", options["failOnDataLoss"]) \
        .load() \
        .select( 
          col('value').cast("string").alias('json'),
          col('key').cast("string").alias('kafka_key'),
          col("timestamp").cast("string").alias('kafka_timestamp')
          ) \
        .withColumn('pjson', from_json(col('json'), jsonSchema)).drop('json')


    query = df \
    .writeStream \
    .foreach(customHandler) \ #This doesn't work in spark 2.3.x Alternatives, please?
    .start()

    query.awaitTermination()