Apache spark 使用selectExpr时,spark 2.1.1中卡夫卡的读取出现异常
我正在运行spark提供的默认示例来计算来自卡夫卡流的单词 以下是我正在运行的代码:Apache spark 使用selectExpr时,spark 2.1.1中卡夫卡的读取出现异常,apache-spark,apache-kafka,spark-structured-streaming,Apache Spark,Apache Kafka,Spark Structured Streaming,我正在运行spark提供的默认示例来计算来自卡夫卡流的单词 以下是我正在运行的代码: 导入scala.Tuple2; 导入org.apache.spark.api.java.javapairdd; 导入org.apache.spark.api.java.JavaRDD; 导入org.apache.spark.api.java.function.FlatMapFunction; 导入org.apache.spark.api.java.function.Function2; 导入org.apache
导入scala.Tuple2;
导入org.apache.spark.api.java.javapairdd;
导入org.apache.spark.api.java.JavaRDD;
导入org.apache.spark.api.java.function.FlatMapFunction;
导入org.apache.spark.api.java.function.Function2;
导入org.apache.spark.api.java.function.PairFunction;
导入org.apache.spark.sql.*;
导入org.apache.spark.sql.streaming.StreamingQuery;
导入org.apache.spark.sql.SparkSession;
导入java.util.array;
导入java.util.Iterator;
导入java.util.List;
导入java.util.regex.Pattern;
导入java.util.Map;
导入java.util.HashMap;
导入org.apache.spark.SparkConf;
导入org.apache.spark.api.java.function.Function2;
导入org.apache.spark.streaming.Duration;
导入org.apache.spark.streaming.api.java.JavaDStream;
导入org.apache.spark.streaming.api.java.JavaPairDStream;
导入org.apache.spark.streaming.api.java.javapairReceiverInputStream;
导入org.apache.spark.streaming.api.java.JavaStreamingContext;
公共类JavaWordCount{
私有静态最终模式空间=Pattern.compile(“”);
公共静态void main(字符串[]args)引发异常{
火花会话火花=火花会话
.builder()
.appName(“JavaWordCount”)
.getOrCreate();
数据集行=spark
.readStream()
.格式(“卡夫卡”)
.option(“kafka.bootstrap.servers”,“localhost:9092”)
.选项(“订阅”、“Altopic”)
.选项(“起始偏移量”、“最新”)
.load();
行。选择EXPR(“将键转换为字符串”、“将值转换为字符串”);
数据集单词=行
.as(Encoders.STRING())
.平面图(
新FlatMapFunction(){
@凌驾
公共迭代器调用(字符串x){
返回Arrays.asList(x.split(“”).iterator();
}
},Encoders.STRING());
Dataset wordCounts=words.groupBy(“值”).count();
StreamingQuery=wordCounts.writeStream()
.outputMode(“完成”)
.格式(“控制台”)
.start();
query.waittermination();
}
}
在pom.xml文件中,我添加了以下依赖项:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
</dependency>
我在运行时遇到以下异常:
lines.selectExpr("CAST key AS STRING","CAST value AS STRING");
例外情况:
Try to map struct<key:binary,value:binary,topic:string,partition:int,offset:bigint,timestamp:timestamp,timestampType:int> to Tuple1, but failed as the number of fields does not line up.
尝试将struct映射到Tuple1,但由于字段数不一致而失败。
请帮助我解决此异常。
谢谢大家! 问题出在这一行
lines.as(Encoders.STRING())
你可以改变
lines.selectExpr("CAST key AS STRING", "CAST value AS STRING");
Dataset<String> words = lines
.as(Encoders.STRING())
行。选择expr(“将键转换为字符串”、“将值转换为字符串”);
数据集单词=行
.as(Encoders.STRING())
到
Dataset words=lines.selectExpr(“将值转换为字符串”)
.as(Encoders.STRING())
您需要使用行的返回值。选择expr
。此方法不会更改行
本身。由于您使用的是.as(Encoders.STRING())
,所以我认为您只需要值
lines.selectExpr("CAST key AS STRING", "CAST value AS STRING");
Dataset<String> words = lines
.as(Encoders.STRING())
Dataset<String> words = lines.selectExpr("CAST value AS STRING")
.as(Encoders.STRING())