pyspark-kafka流数据处理程序
我将spark 2.3.2与pyspark一起使用,刚刚发现pyspark-kafka流数据处理程序,pyspark,spark-streaming,Pyspark,Spark Streaming,我将spark 2.3.2与pyspark一起使用,刚刚发现foreach和foreachBatch在此配置中的“DataStreamWriter”对象中不可用。问题是公司Hadoop是2.6,spark 2.4(提供了我所需要的)不起作用(SparkSession正在崩溃)。还有其他方法可以将数据发送到自定义处理程序并处理流数据吗 这是我到目前为止的代码: def streamLoad(self,customHandler): options = self.options
foreach
和foreachBatch
在此配置中的“DataStreamWriter”对象中不可用。问题是公司Hadoop是2.6,spark 2.4(提供了我所需要的)不起作用(SparkSession正在崩溃)。还有其他方法可以将数据发送到自定义处理程序并处理流数据吗
这是我到目前为止的代码:
def streamLoad(self,customHandler):
options = self.options
self.logger.info("Recuperando o schema baseado na estrutura do JSON")
jsonStrings = ['{"sku":"9","ean":"4","name":"DVD","description":"foo description","categories":[{"code":"M02_BLURAY_E_DVD_PLAYER"}],"attributes":[{"name":"attrTeste","value":"Teste"}]}']
myRDD = self.spark.sparkContext.parallelize(jsonStrings)
jsonSchema = self.spark.read.json(myRDD).schema # Maybe there is a way to serialize this
self.logger.info("Iniciando o streaming no Kafka[opções: {}]".format(str(options)))
df = self.spark \
.readStream \
.format("kafka") \
.option("maxFilesPerTrigger", 1) \
.option("kafka.bootstrap.servers", options["kafka.bootstrap.servers"]) \
.option("startingOffsets", options["startingOffsets"]) \
.option("subscribe", options["subscribe"]) \
.option("failOnDataLoss", options["failOnDataLoss"]) \
.load() \
.select(
col('value').cast("string").alias('json'),
col('key').cast("string").alias('kafka_key'),
col("timestamp").cast("string").alias('kafka_timestamp')
) \
.withColumn('pjson', from_json(col('json'), jsonSchema)).drop('json')
query = df \
.writeStream \
.foreach(customHandler) \ #This doesn't work in spark 2.3.x Alternatives, please?
.start()
query.awaitTermination()