Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/apache-kafka/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 从Kafka主题反序列化Spark结构化流数据_Apache Spark_Apache Kafka_Spark Streaming Kafka - Fatal编程技术网

Apache spark 从Kafka主题反序列化Spark结构化流数据

Apache spark 从Kafka主题反序列化Spark结构化流数据,apache-spark,apache-kafka,spark-streaming-kafka,Apache Spark,Apache Kafka,Spark Streaming Kafka,我正在使用卡夫卡2.3.0和Spark 2.3.4。我已经构建了一个卡夫卡连接器,它读取CSV文件并将一行从CSV发送到相关的卡夫卡主题。线路是这样的: “201310,XYZ001,Sup,XYZ,A,0,售前,6,标注,0,0,1,N,潜在客户”。 CSV包含1000行这样的行。连接器能够成功地将它们发布到主题上,我也能够在Spark中获得信息。我不确定如何将该消息反序列化到我的模式?请注意,这些消息是无头的,因此卡夫卡消息中的关键部分为空。值部分包括完整的CSV字符串,如上所述。我的代码如

我正在使用卡夫卡2.3.0和Spark 2.3.4。我已经构建了一个卡夫卡连接器,它读取CSV文件并将一行从CSV发送到相关的卡夫卡主题。线路是这样的: “201310,XYZ001,Sup,XYZ,A,0,售前,6,标注,0,0,1,N,潜在客户”。 CSV包含1000行这样的行。连接器能够成功地将它们发布到主题上,我也能够在Spark中获得信息。我不确定如何将该消息反序列化到我的模式?请注意,这些消息是无头的,因此卡夫卡消息中的关键部分为空。值部分包括完整的CSV字符串,如上所述。我的代码如下

//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
                .selectExpr("CAST(value AS STRING)")
                .as(Encoders.STRING());

//create new columns, parse out the orig message and fill column with the values
Dataset<Row> dataAsSchema2 = indivValues
                    .selectExpr("value",
                            "split(value,',')[0] as time",
                            "split(value,',')[1] as cname",
                            "split(value,',')[2] as crole",
                            "split(value,',')[3] as bname",
                            "split(value,',')[4] as stage",
                            "split(value,',')[5] as intid",
                            "split(value,',')[6] as intname",
                            "split(value,',')[7] as intcatid",
                            "split(value,',')[8] as catname",
                            "split(value,',')[9] as are_vval",
                            "split(value,',')[10] as isee_vval",
                            "split(value,',')[11] as opcode",
                            "split(value,',')[12] as optype",
                            "split(value,',')[13] as opname")
                    .drop("value");

//remove any whitespaces as they interfere with data type conversions
dataAsSchema2 = dataAsSchema2
                    .withColumn("intid", functions.regexp_replace(functions.col("int_id"),
                            " ", ""))
                    .withColumn("intcatid", functions.regexp_replace(functions.col("intcatid"),
                            " ", ""))
                    .withColumn("are_vval", functions.regexp_replace(functions.col("are_vval"),
                            " ", ""))
                    .withColumn("isee_vval", functions.regexp_replace(functions.col("isee_vval"),
                            " ", ""))
                    .withColumn("opcode", functions.regexp_replace(functions.col("opcode"),
                            " ", ""));

    //change types to ready for calc
dataAsSchema2 = dataAsSchema2
                    .withColumn("intcatid",functions.col("intcatid").cast(DataTypes.IntegerType))
                    .withColumn("intid",functions.col("intid").cast(DataTypes.IntegerType))
                    .withColumn("are_vval",functions.col("are_vval").cast(DataTypes.IntegerType))
                    .withColumn("isee_vval",functions.col("isee_vval").cast(DataTypes.IntegerType))
                    .withColumn("opcode",functions.col("opcode").cast(DataTypes.IntegerType));


//build a POJO dataset    
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
        Dataset<Pojoclass2> pjClass = new Dataset<Pojoclass2>(sparkSession, dataAsSchema2.logicalPlan(), encoder);
我看了这个,但无法将它移植到我的csv案例中。此外,我还尝试了其他spark sql机制,尝试从“value”列检索单个行,但没有效果。如果我确实设法获得编译版本(例如indivValues数据集或dsRawData上的映射),我会得到类似以下错误:“org.apache.spark.sql.AnalysisException:无法解析给定输入列:[value];”的“
IC
”。若我理解正确的话,这是因为value是一个逗号分隔的字符串,而spark并不会在我不做“某事”的情况下为我神奇地映射它

//build the spark session
SparkSession sparkSession = SparkSession.builder()
    .appName(seCfg.arg0AppName)
    .config("spark.cassandra.connection.host",config.arg2CassandraIp)
    .getOrCreate();

...
//my target schema is this:
StructType schema = DataTypes.createStructType(new StructField[] {
    DataTypes.createStructField("timeOfOrigin",  DataTypes.TimestampType, true),
    DataTypes.createStructField("cName", DataTypes.StringType, true),
    DataTypes.createStructField("cRole", DataTypes.StringType, true),
    DataTypes.createStructField("bName", DataTypes.StringType, true),
    DataTypes.createStructField("stage", DataTypes.StringType, true),
    DataTypes.createStructField("intId", DataTypes.IntegerType, true),
    DataTypes.createStructField("intName", DataTypes.StringType, true),
    DataTypes.createStructField("intCatId", DataTypes.IntegerType, true),
    DataTypes.createStructField("catName", DataTypes.StringType, true),
    DataTypes.createStructField("are_vval", DataTypes.IntegerType, true),
    DataTypes.createStructField("isee_vval", DataTypes.IntegerType, true),
    DataTypes.createStructField("opCode", DataTypes.IntegerType, true),
    DataTypes.createStructField("opType", DataTypes.StringType, true),
    DataTypes.createStructField("opName", DataTypes.StringType, true)
    });
...

 Dataset<Row> dsRawData = sparkSession
    .readStream()
    .format("kafka")
    .option("kafka.bootstrap.servers", config.arg3Kafkabootstrapurl)
    .option("subscribe", config.arg1TopicName)
    .option("failOnDataLoss", "false")
    .load();

//getting individual terms like '201310', 'XYZ001'.. from "values"
Dataset<String> indivValues = dsRawData
    .selectExpr("CAST(value AS STRING)")
    .as(Encoders.STRING())
    .flatMap((FlatMapFunction<String, String>) x -> Arrays.asList(x.split(",")).iterator(), Encoders.STRING());

//indivValues when printed to console looks like below which confirms that //I receive the data correctly and completely
/*
When printed on console, looks like this:
                +--------------------+
                |               value|
                +--------------------+
                |              201310|
                |              XYZ001|
                |                 Sup|
                |                 XYZ|
                |                   A|
                |                   0|
                |            Presales|
                |                   6|
                |             Callout|
                |                   0|
                |                   0|
                |                   1|
                |                   N|
                |            Prospect|
                +--------------------+
*/

StreamingQuery sq = indivValues.writeStream()
    .outputMode("append")
    .format("console")
    .start();
//await termination
sq.awaitTermination();
//构建spark会话
SparkSession SparkSession=SparkSession.builder()
.appName(seCfg.arg0AppName)
.config(“spark.cassandra.connection.host”,config.arg2CassandraIp)
.getOrCreate();
...
//我的目标模式是:
StructType架构=数据类型。createStructType(新的StructField[]{
DataTypes.createStructField(“timeOfOrigin”,DataTypes.TimestampType,true),
DataTypes.createStructField(“cName”,DataTypes.StringType,true),
DataTypes.createStructField(“cRole”,DataTypes.StringType,true),
DataTypes.createStructField(“bName”,DataTypes.StringType,true),
DataTypes.createStructField(“stage”,DataTypes.StringType,true),
DataTypes.createStructField(“intId”,DataTypes.IntegerType,true),
DataTypes.createStructField(“intName”,DataTypes.StringType,true),
DataTypes.createStructField(“intCatId”,DataTypes.IntegerType,true),
DataTypes.createStructField(“catName”,DataTypes.StringType,true),
DataTypes.createStructField(“are_vval”,DataTypes.IntegerType,true),
DataTypes.createStructField(“isee_vval”,DataTypes.IntegerType,true),
DataTypes.createStructField(“操作码”,DataTypes.IntegerType,true),
DataTypes.createStructField(“opType”,DataTypes.StringType,true),
DataTypes.createStructField(“opName”,DataTypes.StringType,true)
});
...
数据集dsRawData=sparkSession
.readStream()
.格式(“卡夫卡”)
.option(“kafka.bootstrap.servers”,config.arg3Kafkabootstrapurl)
.option(“订阅”,config.arg1TopicName)
.选项(“failOnDataLoss”、“false”)
.load();
//获取诸如“201310”、“XYZ001”等单独术语。。从“价值观”
数据集indivValues=dsRawData
.selectExpr(“转换(值为字符串)”)
.as(Encoders.STRING())
.flatMap((FlatMapFunction)x->Arrays.asList(x.split(“,”).iterator(),Encoders.STRING();
//打印到控制台时的IndivValue如下所示,确认//我正确且完整地接收到数据
/*
在控制台上打印时,如下所示:
+--------------------+
|价值观|
+--------------------+
|              201310|
|XYZ001|
|苏普|
|XYZ|
|A|
|                   0|
|预售|
|                   6|
|标注|
|                   0|
|                   0|
|                   1|
|N|
|前景|
+--------------------+
*/
StreamingQuery sq=indivValues.writeStream()
.outputMode(“追加”)
.格式(“控制台”)
.start();
//等待终止
sq.等待终止();
  • 我要求将数据类型化为上面所示的自定义模式,因为我将对其进行数学计算(针对每一个新行和一些旧行组合)
  • 在将标题推到主题之前,是否最好在Kafka连接器源任务中合成标题?有标题会使问题的解决更简单吗

谢谢

我现在已经能够解决这个问题了。通过使用sparksql。解决方案的代码如下所示

//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
                .selectExpr("CAST(value AS STRING)")
                .as(Encoders.STRING());

//create new columns, parse out the orig message and fill column with the values
Dataset<Row> dataAsSchema2 = indivValues
                    .selectExpr("value",
                            "split(value,',')[0] as time",
                            "split(value,',')[1] as cname",
                            "split(value,',')[2] as crole",
                            "split(value,',')[3] as bname",
                            "split(value,',')[4] as stage",
                            "split(value,',')[5] as intid",
                            "split(value,',')[6] as intname",
                            "split(value,',')[7] as intcatid",
                            "split(value,',')[8] as catname",
                            "split(value,',')[9] as are_vval",
                            "split(value,',')[10] as isee_vval",
                            "split(value,',')[11] as opcode",
                            "split(value,',')[12] as optype",
                            "split(value,',')[13] as opname")
                    .drop("value");

//remove any whitespaces as they interfere with data type conversions
dataAsSchema2 = dataAsSchema2
                    .withColumn("intid", functions.regexp_replace(functions.col("int_id"),
                            " ", ""))
                    .withColumn("intcatid", functions.regexp_replace(functions.col("intcatid"),
                            " ", ""))
                    .withColumn("are_vval", functions.regexp_replace(functions.col("are_vval"),
                            " ", ""))
                    .withColumn("isee_vval", functions.regexp_replace(functions.col("isee_vval"),
                            " ", ""))
                    .withColumn("opcode", functions.regexp_replace(functions.col("opcode"),
                            " ", ""));

    //change types to ready for calc
dataAsSchema2 = dataAsSchema2
                    .withColumn("intcatid",functions.col("intcatid").cast(DataTypes.IntegerType))
                    .withColumn("intid",functions.col("intid").cast(DataTypes.IntegerType))
                    .withColumn("are_vval",functions.col("are_vval").cast(DataTypes.IntegerType))
                    .withColumn("isee_vval",functions.col("isee_vval").cast(DataTypes.IntegerType))
                    .withColumn("opcode",functions.col("opcode").cast(DataTypes.IntegerType));


//build a POJO dataset    
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
        Dataset<Pojoclass2> pjClass = new Dataset<Pojoclass2>(sparkSession, dataAsSchema2.logicalPlan(), encoder);
//dsRawData具有来自卡夫卡的原始传入数据。。。
数据集indivValues=dsRawData
.selectExpr(“转换(值为字符串)”)
.as(Encoders.STRING());
//创建新列,解析出原始消息并用值填充列
数据集dataAsSchema2=独立值
.selectExpr(“值”,
“将(值“,”)[0]拆分为时间”,
“将(值,”,“)[1]拆分为cname”,
“拆分(值,”,“)[2]为crole”,
“拆分(值“,”)[3]为bname”,
“拆分(值“,”)[4]作为阶段”,
“拆分(值“,”)[5]为intid”,
“拆分(值“,”)[6]作为intname”,
“拆分(值“,”)[7]为intcatid”,
“拆分(值“,”)[8]作为catname”,
“拆分(值“,”)[9]按原样拆分”,
“拆分(值,”)[10]为isee_vval”,
“拆分(值“,”)[11]作为操作码”,
“拆分(值“,”)[12]作为optype”,
“拆分(值“,”)[13]作为opname”)
.下跌(“价值”);
//删除任何空格,因为它们会干扰数据类型转换
dataAsSchema2=dataAsSchema2
.带列(“intid”,f