Apache spark 从Kafka主题反序列化Spark结构化流数据_Apache Spark_Apache Kafka_Spark Streaming Kafka

Apache spark 从Kafka主题反序列化Spark结构化流数据

apache-spark apache-kafka

Apache spark 从Kafka主题反序列化Spark结构化流数据,apache-spark,apache-kafka,spark-streaming-kafka,Apache Spark,Apache Kafka,Spark Streaming Kafka,我正在使用卡夫卡2.3.0和Spark 2.3.4。我已经构建了一个卡夫卡连接器，它读取CSV文件并将一行从CSV发送到相关的卡夫卡主题。线路是这样的： “201310，XYZ001，Sup，XYZ，A，0，售前，6，标注，0,0,1，N，潜在客户”。 CSV包含1000行这样的行。连接器能够成功地将它们发布到主题上，我也能够在Spark中获得信息。我不确定如何将该消息反序列化到我的模式？请注意，这些消息是无头的，因此卡夫卡消息中的关键部分为空。值部分包括完整的CSV字符串，如上所述。我的代码如

我正在使用卡夫卡2.3.0和Spark 2.3.4。我已经构建了一个卡夫卡连接器，它读取CSV文件并将一行从CSV发送到相关的卡夫卡主题。线路是这样的： “201310，XYZ001，Sup，XYZ，A，0，售前，6，标注，0,0,1，N，潜在客户”。 CSV包含1000行这样的行。连接器能够成功地将它们发布到主题上，我也能够在Spark中获得信息。我不确定如何将该消息反序列化到我的模式？请注意，这些消息是无头的，因此卡夫卡消息中的关键部分为空。值部分包括完整的CSV字符串，如上所述。我的代码如下

//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
                .selectExpr("CAST(value AS STRING)")
                .as(Encoders.STRING());

//create new columns, parse out the orig message and fill column with the values
Dataset<Row> dataAsSchema2 = indivValues
                    .selectExpr("value",
                            "split(value,',')[0] as time",
                            "split(value,',')[1] as cname",
                            "split(value,',')[2] as crole",
                            "split(value,',')[3] as bname",
                            "split(value,',')[4] as stage",
                            "split(value,',')[5] as intid",
                            "split(value,',')[6] as intname",
                            "split(value,',')[7] as intcatid",
                            "split(value,',')[8] as catname",
                            "split(value,',')[9] as are_vval",
                            "split(value,',')[10] as isee_vval",
                            "split(value,',')[11] as opcode",
                            "split(value,',')[12] as optype",
                            "split(value,',')[13] as opname")
                    .drop("value");

//remove any whitespaces as they interfere with data type conversions
dataAsSchema2 = dataAsSchema2
                    .withColumn("intid", functions.regexp_replace(functions.col("int_id"),
                            " ", ""))
                    .withColumn("intcatid", functions.regexp_replace(functions.col("intcatid"),
                            " ", ""))
                    .withColumn("are_vval", functions.regexp_replace(functions.col("are_vval"),
                            " ", ""))
                    .withColumn("isee_vval", functions.regexp_replace(functions.col("isee_vval"),
                            " ", ""))
                    .withColumn("opcode", functions.regexp_replace(functions.col("opcode"),
                            " ", ""));

    //change types to ready for calc
dataAsSchema2 = dataAsSchema2
                    .withColumn("intcatid",functions.col("intcatid").cast(DataTypes.IntegerType))
                    .withColumn("intid",functions.col("intid").cast(DataTypes.IntegerType))
                    .withColumn("are_vval",functions.col("are_vval").cast(DataTypes.IntegerType))
                    .withColumn("isee_vval",functions.col("isee_vval").cast(DataTypes.IntegerType))
                    .withColumn("opcode",functions.col("opcode").cast(DataTypes.IntegerType));


//build a POJO dataset    
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
        Dataset<Pojoclass2> pjClass = new Dataset<Pojoclass2>(sparkSession, dataAsSchema2.logicalPlan(), encoder);

我看了这个，但无法将它移植到我的csv案例中。此外，我还尝试了其他spark sql机制，尝试从“value”列检索单个行，但没有效果。如果我确实设法获得编译版本（例如indivValues数据集或dsRawData上的映射），我会得到类似以下错误：“org.apache.spark.sql.AnalysisException:无法解析给定输入列：[value]；”的“

IC

”。若我理解正确的话，这是因为value是一个逗号分隔的字符串，而spark并不会在我不做“某事”的情况下为我神奇地映射它

//build the spark session
SparkSession sparkSession = SparkSession.builder()
    .appName(seCfg.arg0AppName)
    .config("spark.cassandra.connection.host",config.arg2CassandraIp)
    .getOrCreate();

...
//my target schema is this:
StructType schema = DataTypes.createStructType(new StructField[] {
    DataTypes.createStructField("timeOfOrigin",  DataTypes.TimestampType, true),
    DataTypes.createStructField("cName", DataTypes.StringType, true),
    DataTypes.createStructField("cRole", DataTypes.StringType, true),
    DataTypes.createStructField("bName", DataTypes.StringType, true),
    DataTypes.createStructField("stage", DataTypes.StringType, true),
    DataTypes.createStructField("intId", DataTypes.IntegerType, true),
    DataTypes.createStructField("intName", DataTypes.StringType, true),
    DataTypes.createStructField("intCatId", DataTypes.IntegerType, true),
    DataTypes.createStructField("catName", DataTypes.StringType, true),
    DataTypes.createStructField("are_vval", DataTypes.IntegerType, true),
    DataTypes.createStructField("isee_vval", DataTypes.IntegerType, true),
    DataTypes.createStructField("opCode", DataTypes.IntegerType, true),
    DataTypes.createStructField("opType", DataTypes.StringType, true),
    DataTypes.createStructField("opName", DataTypes.StringType, true)
    });
...

 Dataset<Row> dsRawData = sparkSession
    .readStream()
    .format("kafka")
    .option("kafka.bootstrap.servers", config.arg3Kafkabootstrapurl)
    .option("subscribe", config.arg1TopicName)
    .option("failOnDataLoss", "false")
    .load();

//getting individual terms like '201310', 'XYZ001'.. from "values"
Dataset<String> indivValues = dsRawData
    .selectExpr("CAST(value AS STRING)")
    .as(Encoders.STRING())
    .flatMap((FlatMapFunction<String, String>) x -> Arrays.asList(x.split(",")).iterator(), Encoders.STRING());

//indivValues when printed to console looks like below which confirms that //I receive the data correctly and completely
/*
When printed on console, looks like this:
                +--------------------+
                |               value|
                +--------------------+
                |              201310|
                |              XYZ001|
                |                 Sup|
                |                 XYZ|
                |                   A|
                |                   0|
                |            Presales|
                |                   6|
                |             Callout|
                |                   0|
                |                   0|
                |                   1|
                |                   N|
                |            Prospect|
                +--------------------+
*/

StreamingQuery sq = indivValues.writeStream()
    .outputMode("append")
    .format("console")
    .start();
//await termination
sq.awaitTermination();

//构建spark会话
SparkSession SparkSession=SparkSession.builder（）
.appName（seCfg.arg0AppName）
.config（“spark.cassandra.connection.host”，config.arg2CassandraIp）
.getOrCreate（）；
...
//我的目标模式是：
StructType架构=数据类型。createStructType（新的StructField[]{
DataTypes.createStructField（“timeOfOrigin”，DataTypes.TimestampType，true），
DataTypes.createStructField（“cName”，DataTypes.StringType，true），
DataTypes.createStructField（“cRole”，DataTypes.StringType，true），
DataTypes.createStructField（“bName”，DataTypes.StringType，true），
DataTypes.createStructField（“stage”，DataTypes.StringType，true），
DataTypes.createStructField（“intId”，DataTypes.IntegerType，true），
DataTypes.createStructField（“intName”，DataTypes.StringType，true），
DataTypes.createStructField（“intCatId”，DataTypes.IntegerType，true），
DataTypes.createStructField（“catName”，DataTypes.StringType，true），
DataTypes.createStructField（“are_vval”，DataTypes.IntegerType，true），
DataTypes.createStructField（“isee_vval”，DataTypes.IntegerType，true），
DataTypes.createStructField（“操作码”，DataTypes.IntegerType，true），
DataTypes.createStructField（“opType”，DataTypes.StringType，true），
DataTypes.createStructField（“opName”，DataTypes.StringType，true）
});
...
数据集dsRawData=sparkSession
.readStream（）
.格式（“卡夫卡”）
.option（“kafka.bootstrap.servers”，config.arg3Kafkabootstrapurl）
.option（“订阅”，config.arg1TopicName）
.选项（“failOnDataLoss”、“false”）
.load（）；
//获取诸如“201310”、“XYZ001”等单独术语。。从“价值观”
数据集indivValues=dsRawData
.selectExpr（“转换（值为字符串）”）
.as（Encoders.STRING（））
.flatMap（（FlatMapFunction）x->Arrays.asList（x.split（“，”）.iterator（），Encoders.STRING（）；
//打印到控制台时的IndivValue如下所示，确认//我正确且完整地接收到数据
/*
在控制台上打印时，如下所示：
+--------------------+
|价值观|
+--------------------+
|              201310|
|XYZ001|
|苏普|
|XYZ|
|A|
|                   0|
|预售|
|                   6|
|标注|
|                   0|
|                   0|
|                   1|
|N|
|前景|
+--------------------+
*/
StreamingQuery sq=indivValues.writeStream（）
.outputMode（“追加”）
.格式（“控制台”）
.start（）；
//等待终止
sq.等待终止（）；

我要求将数据类型化为上面所示的自定义模式，因为我将对其进行数学计算（针对每一个新行和一些旧行组合）
在将标题推到主题之前，是否最好在Kafka连接器源任务中合成标题？有标题会使问题的解决更简单吗

谢谢

我现在已经能够解决这个问题了。通过使用sparksql。解决方案的代码如下所示

//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
                .selectExpr("CAST(value AS STRING)")
                .as(Encoders.STRING());

//create new columns, parse out the orig message and fill column with the values
Dataset<Row> dataAsSchema2 = indivValues
                    .selectExpr("value",
                            "split(value,',')[0] as time",
                            "split(value,',')[1] as cname",
                            "split(value,',')[2] as crole",
                            "split(value,',')[3] as bname",
                            "split(value,',')[4] as stage",
                            "split(value,',')[5] as intid",
                            "split(value,',')[6] as intname",
                            "split(value,',')[7] as intcatid",
                            "split(value,',')[8] as catname",
                            "split(value,',')[9] as are_vval",
                            "split(value,',')[10] as isee_vval",
                            "split(value,',')[11] as opcode",
                            "split(value,',')[12] as optype",
                            "split(value,',')[13] as opname")
                    .drop("value");

//remove any whitespaces as they interfere with data type conversions
dataAsSchema2 = dataAsSchema2
                    .withColumn("intid", functions.regexp_replace(functions.col("int_id"),
                            " ", ""))
                    .withColumn("intcatid", functions.regexp_replace(functions.col("intcatid"),
                            " ", ""))
                    .withColumn("are_vval", functions.regexp_replace(functions.col("are_vval"),
                            " ", ""))
                    .withColumn("isee_vval", functions.regexp_replace(functions.col("isee_vval"),
                            " ", ""))
                    .withColumn("opcode", functions.regexp_replace(functions.col("opcode"),
                            " ", ""));

    //change types to ready for calc
dataAsSchema2 = dataAsSchema2
                    .withColumn("intcatid",functions.col("intcatid").cast(DataTypes.IntegerType))
                    .withColumn("intid",functions.col("intid").cast(DataTypes.IntegerType))
                    .withColumn("are_vval",functions.col("are_vval").cast(DataTypes.IntegerType))
                    .withColumn("isee_vval",functions.col("isee_vval").cast(DataTypes.IntegerType))
                    .withColumn("opcode",functions.col("opcode").cast(DataTypes.IntegerType));


//build a POJO dataset    
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
        Dataset<Pojoclass2> pjClass = new Dataset<Pojoclass2>(sparkSession, dataAsSchema2.logicalPlan(), encoder);

//dsRawData具有来自卡夫卡的原始传入数据。。。
数据集indivValues=dsRawData
.selectExpr（“转换（值为字符串）”）
.as（Encoders.STRING（））；
//创建新列，解析出原始消息并用值填充列
数据集dataAsSchema2=独立值
.selectExpr（“值”，
“将（值“，”）[0]拆分为时间”，
“将（值，”，“）[1]拆分为cname”，
“拆分（值，”，“）[2]为crole”，
“拆分（值“，”）[3]为bname”，
“拆分（值“，”）[4]作为阶段”，
“拆分（值“，”）[5]为intid”，
“拆分（值“，”）[6]作为intname”，
“拆分（值“，”）[7]为intcatid”，
“拆分（值“，”）[8]作为catname”，
“拆分（值“，”）[9]按原样拆分”，
“拆分（值，”）[10]为isee_vval”，
“拆分（值“，”）[11]作为操作码”，
“拆分（值“，”）[12]作为optype”，
“拆分（值“，”）[13]作为opname”）
.下跌（“价值”）；
//删除任何空格，因为它们会干扰数据类型转换
dataAsSchema2=dataAsSchema2
.带列（“intid”，f