Apache spark 从Kafka主题反序列化Spark结构化流数据
我正在使用卡夫卡2.3.0和Spark 2.3.4。我已经构建了一个卡夫卡连接器,它读取CSV文件并将一行从CSV发送到相关的卡夫卡主题。线路是这样的: “201310,XYZ001,Sup,XYZ,A,0,售前,6,标注,0,0,1,N,潜在客户”。 CSV包含1000行这样的行。连接器能够成功地将它们发布到主题上,我也能够在Spark中获得信息。我不确定如何将该消息反序列化到我的模式?请注意,这些消息是无头的,因此卡夫卡消息中的关键部分为空。值部分包括完整的CSV字符串,如上所述。我的代码如下Apache spark 从Kafka主题反序列化Spark结构化流数据,apache-spark,apache-kafka,spark-streaming-kafka,Apache Spark,Apache Kafka,Spark Streaming Kafka,我正在使用卡夫卡2.3.0和Spark 2.3.4。我已经构建了一个卡夫卡连接器,它读取CSV文件并将一行从CSV发送到相关的卡夫卡主题。线路是这样的: “201310,XYZ001,Sup,XYZ,A,0,售前,6,标注,0,0,1,N,潜在客户”。 CSV包含1000行这样的行。连接器能够成功地将它们发布到主题上,我也能够在Spark中获得信息。我不确定如何将该消息反序列化到我的模式?请注意,这些消息是无头的,因此卡夫卡消息中的关键部分为空。值部分包括完整的CSV字符串,如上所述。我的代码如
//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
//create new columns, parse out the orig message and fill column with the values
Dataset<Row> dataAsSchema2 = indivValues
.selectExpr("value",
"split(value,',')[0] as time",
"split(value,',')[1] as cname",
"split(value,',')[2] as crole",
"split(value,',')[3] as bname",
"split(value,',')[4] as stage",
"split(value,',')[5] as intid",
"split(value,',')[6] as intname",
"split(value,',')[7] as intcatid",
"split(value,',')[8] as catname",
"split(value,',')[9] as are_vval",
"split(value,',')[10] as isee_vval",
"split(value,',')[11] as opcode",
"split(value,',')[12] as optype",
"split(value,',')[13] as opname")
.drop("value");
//remove any whitespaces as they interfere with data type conversions
dataAsSchema2 = dataAsSchema2
.withColumn("intid", functions.regexp_replace(functions.col("int_id"),
" ", ""))
.withColumn("intcatid", functions.regexp_replace(functions.col("intcatid"),
" ", ""))
.withColumn("are_vval", functions.regexp_replace(functions.col("are_vval"),
" ", ""))
.withColumn("isee_vval", functions.regexp_replace(functions.col("isee_vval"),
" ", ""))
.withColumn("opcode", functions.regexp_replace(functions.col("opcode"),
" ", ""));
//change types to ready for calc
dataAsSchema2 = dataAsSchema2
.withColumn("intcatid",functions.col("intcatid").cast(DataTypes.IntegerType))
.withColumn("intid",functions.col("intid").cast(DataTypes.IntegerType))
.withColumn("are_vval",functions.col("are_vval").cast(DataTypes.IntegerType))
.withColumn("isee_vval",functions.col("isee_vval").cast(DataTypes.IntegerType))
.withColumn("opcode",functions.col("opcode").cast(DataTypes.IntegerType));
//build a POJO dataset
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
Dataset<Pojoclass2> pjClass = new Dataset<Pojoclass2>(sparkSession, dataAsSchema2.logicalPlan(), encoder);
我看了这个,但无法将它移植到我的csv案例中。此外,我还尝试了其他spark sql机制,尝试从“value”列检索单个行,但没有效果。如果我确实设法获得编译版本(例如indivValues数据集或dsRawData上的映射),我会得到类似以下错误:“org.apache.spark.sql.AnalysisException:无法解析给定输入列:[value];”的“IC
”。若我理解正确的话,这是因为value是一个逗号分隔的字符串,而spark并不会在我不做“某事”的情况下为我神奇地映射它
//build the spark session
SparkSession sparkSession = SparkSession.builder()
.appName(seCfg.arg0AppName)
.config("spark.cassandra.connection.host",config.arg2CassandraIp)
.getOrCreate();
...
//my target schema is this:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("timeOfOrigin", DataTypes.TimestampType, true),
DataTypes.createStructField("cName", DataTypes.StringType, true),
DataTypes.createStructField("cRole", DataTypes.StringType, true),
DataTypes.createStructField("bName", DataTypes.StringType, true),
DataTypes.createStructField("stage", DataTypes.StringType, true),
DataTypes.createStructField("intId", DataTypes.IntegerType, true),
DataTypes.createStructField("intName", DataTypes.StringType, true),
DataTypes.createStructField("intCatId", DataTypes.IntegerType, true),
DataTypes.createStructField("catName", DataTypes.StringType, true),
DataTypes.createStructField("are_vval", DataTypes.IntegerType, true),
DataTypes.createStructField("isee_vval", DataTypes.IntegerType, true),
DataTypes.createStructField("opCode", DataTypes.IntegerType, true),
DataTypes.createStructField("opType", DataTypes.StringType, true),
DataTypes.createStructField("opName", DataTypes.StringType, true)
});
...
Dataset<Row> dsRawData = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", config.arg3Kafkabootstrapurl)
.option("subscribe", config.arg1TopicName)
.option("failOnDataLoss", "false")
.load();
//getting individual terms like '201310', 'XYZ001'.. from "values"
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING())
.flatMap((FlatMapFunction<String, String>) x -> Arrays.asList(x.split(",")).iterator(), Encoders.STRING());
//indivValues when printed to console looks like below which confirms that //I receive the data correctly and completely
/*
When printed on console, looks like this:
+--------------------+
| value|
+--------------------+
| 201310|
| XYZ001|
| Sup|
| XYZ|
| A|
| 0|
| Presales|
| 6|
| Callout|
| 0|
| 0|
| 1|
| N|
| Prospect|
+--------------------+
*/
StreamingQuery sq = indivValues.writeStream()
.outputMode("append")
.format("console")
.start();
//await termination
sq.awaitTermination();
//构建spark会话
SparkSession SparkSession=SparkSession.builder()
.appName(seCfg.arg0AppName)
.config(“spark.cassandra.connection.host”,config.arg2CassandraIp)
.getOrCreate();
...
//我的目标模式是:
StructType架构=数据类型。createStructType(新的StructField[]{
DataTypes.createStructField(“timeOfOrigin”,DataTypes.TimestampType,true),
DataTypes.createStructField(“cName”,DataTypes.StringType,true),
DataTypes.createStructField(“cRole”,DataTypes.StringType,true),
DataTypes.createStructField(“bName”,DataTypes.StringType,true),
DataTypes.createStructField(“stage”,DataTypes.StringType,true),
DataTypes.createStructField(“intId”,DataTypes.IntegerType,true),
DataTypes.createStructField(“intName”,DataTypes.StringType,true),
DataTypes.createStructField(“intCatId”,DataTypes.IntegerType,true),
DataTypes.createStructField(“catName”,DataTypes.StringType,true),
DataTypes.createStructField(“are_vval”,DataTypes.IntegerType,true),
DataTypes.createStructField(“isee_vval”,DataTypes.IntegerType,true),
DataTypes.createStructField(“操作码”,DataTypes.IntegerType,true),
DataTypes.createStructField(“opType”,DataTypes.StringType,true),
DataTypes.createStructField(“opName”,DataTypes.StringType,true)
});
...
数据集dsRawData=sparkSession
.readStream()
.格式(“卡夫卡”)
.option(“kafka.bootstrap.servers”,config.arg3Kafkabootstrapurl)
.option(“订阅”,config.arg1TopicName)
.选项(“failOnDataLoss”、“false”)
.load();
//获取诸如“201310”、“XYZ001”等单独术语。。从“价值观”
数据集indivValues=dsRawData
.selectExpr(“转换(值为字符串)”)
.as(Encoders.STRING())
.flatMap((FlatMapFunction)x->Arrays.asList(x.split(“,”).iterator(),Encoders.STRING();
//打印到控制台时的IndivValue如下所示,确认//我正确且完整地接收到数据
/*
在控制台上打印时,如下所示:
+--------------------+
|价值观|
+--------------------+
| 201310|
|XYZ001|
|苏普|
|XYZ|
|A|
| 0|
|预售|
| 6|
|标注|
| 0|
| 0|
| 1|
|N|
|前景|
+--------------------+
*/
StreamingQuery sq=indivValues.writeStream()
.outputMode(“追加”)
.格式(“控制台”)
.start();
//等待终止
sq.等待终止();
- 我要求将数据类型化为上面所示的自定义模式,因为我将对其进行数学计算(针对每一个新行和一些旧行组合)
- 在将标题推到主题之前,是否最好在Kafka连接器源任务中合成标题?有标题会使问题的解决更简单吗
谢谢 我现在已经能够解决这个问题了。通过使用sparksql。解决方案的代码如下所示
//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
//create new columns, parse out the orig message and fill column with the values
Dataset<Row> dataAsSchema2 = indivValues
.selectExpr("value",
"split(value,',')[0] as time",
"split(value,',')[1] as cname",
"split(value,',')[2] as crole",
"split(value,',')[3] as bname",
"split(value,',')[4] as stage",
"split(value,',')[5] as intid",
"split(value,',')[6] as intname",
"split(value,',')[7] as intcatid",
"split(value,',')[8] as catname",
"split(value,',')[9] as are_vval",
"split(value,',')[10] as isee_vval",
"split(value,',')[11] as opcode",
"split(value,',')[12] as optype",
"split(value,',')[13] as opname")
.drop("value");
//remove any whitespaces as they interfere with data type conversions
dataAsSchema2 = dataAsSchema2
.withColumn("intid", functions.regexp_replace(functions.col("int_id"),
" ", ""))
.withColumn("intcatid", functions.regexp_replace(functions.col("intcatid"),
" ", ""))
.withColumn("are_vval", functions.regexp_replace(functions.col("are_vval"),
" ", ""))
.withColumn("isee_vval", functions.regexp_replace(functions.col("isee_vval"),
" ", ""))
.withColumn("opcode", functions.regexp_replace(functions.col("opcode"),
" ", ""));
//change types to ready for calc
dataAsSchema2 = dataAsSchema2
.withColumn("intcatid",functions.col("intcatid").cast(DataTypes.IntegerType))
.withColumn("intid",functions.col("intid").cast(DataTypes.IntegerType))
.withColumn("are_vval",functions.col("are_vval").cast(DataTypes.IntegerType))
.withColumn("isee_vval",functions.col("isee_vval").cast(DataTypes.IntegerType))
.withColumn("opcode",functions.col("opcode").cast(DataTypes.IntegerType));
//build a POJO dataset
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
Dataset<Pojoclass2> pjClass = new Dataset<Pojoclass2>(sparkSession, dataAsSchema2.logicalPlan(), encoder);
//dsRawData具有来自卡夫卡的原始传入数据。。。
数据集indivValues=dsRawData
.selectExpr(“转换(值为字符串)”)
.as(Encoders.STRING());
//创建新列,解析出原始消息并用值填充列
数据集dataAsSchema2=独立值
.selectExpr(“值”,
“将(值“,”)[0]拆分为时间”,
“将(值,”,“)[1]拆分为cname”,
“拆分(值,”,“)[2]为crole”,
“拆分(值“,”)[3]为bname”,
“拆分(值“,”)[4]作为阶段”,
“拆分(值“,”)[5]为intid”,
“拆分(值“,”)[6]作为intname”,
“拆分(值“,”)[7]为intcatid”,
“拆分(值“,”)[8]作为catname”,
“拆分(值“,”)[9]按原样拆分”,
“拆分(值,”)[10]为isee_vval”,
“拆分(值“,”)[11]作为操作码”,
“拆分(值“,”)[12]作为optype”,
“拆分(值“,”)[13]作为opname”)
.下跌(“价值”);
//删除任何空格,因为它们会干扰数据类型转换
dataAsSchema2=dataAsSchema2
.带列(“intid”,f