Apache spark 结构化流:由于检查点数据,重新启动时出现流异常

Apache spark 结构化流:由于检查点数据,重新启动时出现流异常,apache-spark,spark-streaming,apache-spark-2.0,Apache Spark,Spark Streaming,Apache Spark 2.0,我有一个读取流来使用来自卡夫卡主题的数据,根据每个传入消息中的属性值,我必须将数据写入S3中的两个不同位置中的任意一个(如果值1写入位置1,否则写入位置2)。 下面是我做这件事的高招 Dataset<Row> kafkaStreamSet = sparkSession .readStream() .format("kafka") .option("kafka.bootstrap.servers", kafk

我有一个读取流来使用来自卡夫卡主题的数据,根据每个传入消息中的属性值,我必须将数据写入S3中的两个不同位置中的任意一个(如果值1写入位置1,否则写入位置2)。
下面是我做这件事的高招

Dataset<Row> kafkaStreamSet = sparkSession 
            .readStream() 
            .format("kafka") 
            .option("kafka.bootstrap.servers", kafkaBootstrap) 
            .option("subscribe", kafkaTopic) 
            .option("startingOffsets", "latest") 
            .option("failOnDataLoss", false) 
            .option("maxOffsetsPerTrigger", offsetsPerTrigger) 
            .load(); 

    //raw message to ClickStream 
    Dataset<ClickStream> ds1 = kafkaStreamSet.mapPartitions(processClickStreamMessages, Encoders.bean(ClickStream.class));   
它似乎工作得很好,在部署应用程序时,我看到数据被写入不同的路径。但是,每当作业在出现故障或手动停止和启动时重新启动时,它都会继续失败,出现以下异常(其中userSessionEventJoin.global是我的主题名)

由以下原因引起:org.apache.spark.sql.streaming.StreamingQueryException:预期的例如{“topicA”:{“0”:23,“1”:-1},“topicB”:{“0”:-2},got{“userSessionEventJoin.global”:{“92”:154362528,“101 org.apache.spark.sql.kafka010.JsonUtils$.partitionoffset(JsonUtils.scala:74) org.apache.spark.sql.kafka010.KafkaSourceOffset$.apply(KafkaSourceOffset.scala:59)

如果我删除了所有的检查点信息,那么它将再次启动并在给定的2个位置启动新的检查点,但这意味着我必须再次从最新的偏移开始处理,并丢失所有以前的偏移。 spark版本是2.1,本主题有100多个分区。
我只使用一个writestream(一个检查点位置)进行了测试,重启时也会发生同样的异常


请建议任何解决方案,谢谢。

您的代码似乎是一个简单的点击流作业。在您的示例中,您创建了一个spark流会话,并使用检查点目录间歇性存储检查点数据

但是,您的代码不知道如何从检查点恢复

在本段结束时,它将变得清晰

以下是生产级流作业的步骤

1) Use getOrCreate API to create your spark streaming session.
   a) getOrCreate takes two parameters. A function "(....) => sparkSession" and the checkpoint directory.
2) When program starts for the first time, it uses the checkpoint directory to store it's inner details. (Among other uses)
3) When program crashes/"stops and restarted", the spark Streaming session is created from the checkpoint hence giving you what you want.
由于在堆栈溢出中不鼓励引用链接,我将在下面给出示例代码,但它基本上是从

/**
*每秒统计从网络接收到的UTF8编码文本中的单词数。此示例也包括
*演示如何将惰性实例化的单例实例用于累加器和广播,以便
*它们可以在驱动程序出现故障时注册。
*
*用法:JavaRecoverableNetworkWordCount
*并描述Spark Streaming将连接到接收的TCP服务器
*目录到检查点数据的HDFS兼容文件系统
*将附加字数的文件
*
*而且必须是绝对路径
*
*要在本地计算机上运行此功能,首先需要运行Netcat服务器
*
*`$nc-9999斯里兰卡克朗`
*
*并按如下方式运行示例:
*
*`$./bin/run example org.apache.spark.examples.streaming.JavaRecoverableNetworkWordCount\
*localhost 9999~/checkpoint/~/out`
*
*如果目录~/checkpoint/不存在(例如,第一次运行),它将创建
*新StreamingContext(将向控制台打印“创建新上下文”)。否则,如果
*检查点数据存在于~/checkpoint/中,然后它将从中创建StreamingContext
*检查点数据。
*
*有关更多详细信息,请参阅联机文档。
*/
公共最终类JavaRecoverableNetworkWordCount{
私有静态最终模式空间=Pattern.compile(“”);
私有静态JavaStreamingContext createContext(字符串ip,
国际港口,
字符串检查点目录,
字符串输出路径){
//如果未看到打印的内容,则表示StreamingContext已加载
//从新的检查站
System.out.println(“创建新上下文”);
File outputFile=新文件(outputPath);
if(outputFile.exists()){
outputFile.delete();
}
SparkConf SparkConf=新的SparkConf().setAppName(“JavaRecoverableNetworkWordCount”);
//创建具有1秒批量大小的上下文
JavaStreamingContext ssc=新的JavaStreamingContext(sparkConf,Durations.seconds(1));
ssc.检查点(检查点目录);
//在目标ip:端口上创建套接字流并计算
//分隔文本的输入流中的单词(例如由“nc”生成)
JavaReceiverInputDStream lines=ssc.socketTextStream(ip,端口);
JavaDStream words=lines.flatMap(x->Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream wordCounts=words.mapToPair(s->new Tuple2(s,1))
.还原基((i1,i2)->i1+i2);
foreachRDD((rdd,time)->{
//获取或注册黑名单广播
广播黑名单=
getInstance(新的JavaSparkContext(rdd.context());
//获取或注册DroppedWordsCenter累加器
长蓄能器下降式词中心=
getInstance(新的JavaSparkContext(rdd.context());
//使用黑名单删除单词,并使用DroppedWordsCenter对单词进行计数
字符串计数=rdd.filter(字计数->{
if(blacklist.value()包含(wordCount.\u 1()){
droppedWordsCenter.add(wordCount._2());
返回false;
}否则{
返回true;
}
}).collect().toString();
   StreamingQuery bookingRequestsParquetStreamWriter = ds2.writeStream().outputMode("append") 
        .format("parquet") 
        .trigger(ProcessingTime.create(bookingRequestProcessingTime, TimeUnit.MILLISECONDS)) 
        .option("checkpointLocation",  "s3://" + s3Bucket+ "/checkpoint/bookingRequests") 
        .partitionBy("eventDate") 
        .start("s3://" + s3Bucket+ "/" +  bookingRequestPath); 



    StreamingQuery PageViewsParquetStreamWriter = ds3.writeStream().outputMode("append") 
        .format("parquet") 
        .trigger(ProcessingTime.create(pageViewProcessingTime, TimeUnit.MILLISECONDS)) 
        .option("checkpointLocation",  "s3://" + s3Bucket+ "/checkpoint/PageViews") 
        .partitionBy("eventDate") 
        .start("s3://" + s3Bucket+ "/" +  pageViewPath); 

    bookingRequestsParquetStreamWriter.awaitTermination(); 

    PageViewsParquetStreamWriter.awaitTermination(); 
1) Use getOrCreate API to create your spark streaming session.
   a) getOrCreate takes two parameters. A function "(....) => sparkSession" and the checkpoint directory.
2) When program starts for the first time, it uses the checkpoint directory to store it's inner details. (Among other uses)
3) When program crashes/"stops and restarted", the spark Streaming session is created from the checkpoint hence giving you what you want.
 /**
            * Counts words in text encoded with UTF8 received from the network every second. This example also
            * shows how to use lazily instantiated singleton instances for Accumulator and Broadcast so that
            * they can be registered on driver failures.
            *
            * Usage: JavaRecoverableNetworkWordCount <hostname> <port> <checkpoint-directory> <output-file>
            *   <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive
            *   data. <checkpoint-directory> directory to HDFS-compatible file system which checkpoint data
            *   <output-file> file to which the word counts will be appended
            *
            * <checkpoint-directory> and <output-file> must be absolute paths
            *
            * To run this on your local machine, you need to first run a Netcat server
            *
            *      `$ nc -lk 9999`
            *
            * and run the example as
            *
            *      `$ ./bin/run-example org.apache.spark.examples.streaming.JavaRecoverableNetworkWordCount \
            *              localhost 9999 ~/checkpoint/ ~/out`
            *
            * If the directory ~/checkpoint/ does not exist (e.g. running for the first time), it will create
            * a new StreamingContext (will print "Creating new context" to the console). Otherwise, if
            * checkpoint data exists in ~/checkpoint/, then it will create StreamingContext from
            * the checkpoint data.
            *
            * Refer to the online documentation for more details.
            */
            public final class JavaRecoverableNetworkWordCount {
            private static final Pattern SPACE = Pattern.compile(" ");

            private static JavaStreamingContext createContext(String ip,
                                                                int port,
                                                                String checkpointDirectory,
                                                                String outputPath) {

                // If you do not see this printed, that means the StreamingContext has been loaded
                // from the new checkpoint
                System.out.println("Creating new context");
                File outputFile = new File(outputPath);
                if (outputFile.exists()) {
                outputFile.delete();
                }
                SparkConf sparkConf = new SparkConf().setAppName("JavaRecoverableNetworkWordCount");
                // Create the context with a 1 second batch size
                JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
                ssc.checkpoint(checkpointDirectory);

                // Create a socket stream on target ip:port and count the
                // words in input stream of \n delimited text (eg. generated by 'nc')
                JavaReceiverInputDStream<String> lines = ssc.socketTextStream(ip, port);
                JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
                JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
                    .reduceByKey((i1, i2) -> i1 + i2);

                wordCounts.foreachRDD((rdd, time) -> {
                // Get or register the blacklist Broadcast
                Broadcast<List<String>> blacklist =
                    JavaWordBlacklist.getInstance(new JavaSparkContext(rdd.context()));
                // Get or register the droppedWordsCounter Accumulator
                LongAccumulator droppedWordsCounter =
                    JavaDroppedWordsCounter.getInstance(new JavaSparkContext(rdd.context()));
                // Use blacklist to drop words and use droppedWordsCounter to count them
                String counts = rdd.filter(wordCount -> {
                    if (blacklist.value().contains(wordCount._1())) {
                    droppedWordsCounter.add(wordCount._2());
                    return false;
                    } else {
                    return true;
                    }
                }).collect().toString();
                String output = "Counts at time " + time + " " + counts;
                System.out.println(output);
                System.out.println("Dropped " + droppedWordsCounter.value() + " word(s) totally");
                System.out.println("Appending to " + outputFile.getAbsolutePath());
                Files.append(output + "\n", outputFile, Charset.defaultCharset());
                });

                return ssc;
            }

            public static void main(String[] args) throws Exception {
                if (args.length != 4) {
                System.err.println("You arguments were " + Arrays.asList(args));
                System.err.println(
                    "Usage: JavaRecoverableNetworkWordCount <hostname> <port> <checkpoint-directory>\n" +
                    "     <output-file>. <hostname> and <port> describe the TCP server that Spark\n" +
                    "     Streaming would connect to receive data. <checkpoint-directory> directory to\n" +
                    "     HDFS-compatible file system which checkpoint data <output-file> file to which\n" +
                    "     the word counts will be appended\n" +
                    "\n" +
                    "In local mode, <master> should be 'local[n]' with n > 1\n" +
                    "Both <checkpoint-directory> and <output-file> must be absolute paths");
                System.exit(1);
                }

                String ip = args[0];
                int port = Integer.parseInt(args[1]);
                String checkpointDirectory = args[2];
                String outputPath = args[3];

                // Function to create JavaStreamingContext without any output operations
                // (used to detect the new context)
                Function0<JavaStreamingContext> createContextFunc =
                    () -> createContext(ip, port, checkpointDirectory, outputPath);

                JavaStreamingContext ssc =
                JavaStreamingContext.getOrCreate(checkpointDirectory, createContextFunc);
                ssc.start();
                ssc.awaitTermination();
            }
            }