Apache spark 结构化流：由于检查点数据，重新启动时出现流异常_Apache Spark_Spark Streaming_Apache Spark 2.0

Apache spark 结构化流：由于检查点数据，重新启动时出现流异常

apache-spark

Apache spark 结构化流：由于检查点数据，重新启动时出现流异常,apache-spark,spark-streaming,apache-spark-2.0,Apache Spark,Spark Streaming,Apache Spark 2.0,我有一个读取流来使用来自卡夫卡主题的数据，根据每个传入消息中的属性值，我必须将数据写入S3中的两个不同位置中的任意一个（如果值1写入位置1，否则写入位置2）。下面是我做这件事的高招 Dataset<Row> kafkaStreamSet = sparkSession .readStream() .format("kafka") .option("kafka.bootstrap.servers", kafk

我有一个读取流来使用来自卡夫卡主题的数据，根据每个传入消息中的属性值，我必须将数据写入S3中的两个不同位置中的任意一个（如果值1写入位置1，否则写入位置2）。
下面是我做这件事的高招

Dataset<Row> kafkaStreamSet = sparkSession 
            .readStream() 
            .format("kafka") 
            .option("kafka.bootstrap.servers", kafkaBootstrap) 
            .option("subscribe", kafkaTopic) 
            .option("startingOffsets", "latest") 
            .option("failOnDataLoss", false) 
            .option("maxOffsetsPerTrigger", offsetsPerTrigger) 
            .load(); 

    //raw message to ClickStream 
    Dataset<ClickStream> ds1 = kafkaStreamSet.mapPartitions(processClickStreamMessages, Encoders.bean(ClickStream.class));

它似乎工作得很好，在部署应用程序时，我看到数据被写入不同的路径。但是，每当作业在出现故障或手动停止和启动时重新启动时，它都会继续失败，出现以下异常（其中userSessionEventJoin.global是我的主题名）

由以下原因引起：org.apache.spark.sql.streaming.StreamingQueryException:预期的例如{“topicA”：{“0”：23，“1”：-1}，“topicB”：{“0”：-2}，got{“userSessionEventJoin.global”：{“92”：154362528，“101 org.apache.spark.sql.kafka010.JsonUtils$.partitionoffset（JsonUtils.scala:74） org.apache.spark.sql.kafka010.KafkaSourceOffset$.apply（KafkaSourceOffset.scala:59）

如果我删除了所有的检查点信息，那么它将再次启动并在给定的2个位置启动新的检查点，但这意味着我必须再次从最新的偏移开始处理，并丢失所有以前的偏移。 spark版本是2.1，本主题有100多个分区。
我只使用一个writestream（一个检查点位置）进行了测试，重启时也会发生同样的异常

请建议任何解决方案，谢谢。

您的代码似乎是一个简单的点击流作业。在您的示例中，您创建了一个spark流会话，并使用检查点目录间歇性存储检查点数据

但是，您的代码不知道如何从检查点恢复

在本段结束时，它将变得清晰

以下是生产级流作业的步骤

1) Use getOrCreate API to create your spark streaming session.
   a) getOrCreate takes two parameters. A function "(....) => sparkSession" and the checkpoint directory.
2) When program starts for the first time, it uses the checkpoint directory to store it's inner details. (Among other uses)
3) When program crashes/"stops and restarted", the spark Streaming session is created from the checkpoint hence giving you what you want.

由于在堆栈溢出中不鼓励引用链接，我将在下面给出示例代码，但它基本上是从

/**
*每秒统计从网络接收到的UTF8编码文本中的单词数。此示例也包括
*演示如何将惰性实例化的单例实例用于累加器和广播，以便
*它们可以在驱动程序出现故障时注册。
*
*用法：JavaRecoverableNetworkWordCount
*并描述Spark Streaming将连接到接收的TCP服务器
*目录到检查点数据的HDFS兼容文件系统
*将附加字数的文件
*
*而且必须是绝对路径
*
*要在本地计算机上运行此功能，首先需要运行Netcat服务器
*
*`$nc-9999斯里兰卡克朗`
*
*并按如下方式运行示例：
*
*`$./bin/run example org.apache.spark.examples.streaming.JavaRecoverableNetworkWordCount\
*localhost 9999~/checkpoint/~/out`
*
*如果目录~/checkpoint/不存在（例如，第一次运行），它将创建
*新StreamingContext（将向控制台打印“创建新上下文”）。否则，如果
*检查点数据存在于~/checkpoint/中，然后它将从中创建StreamingContext
*检查点数据。
*
*有关更多详细信息，请参阅联机文档。
*/
公共最终类JavaRecoverableNetworkWordCount{
私有静态最终模式空间=Pattern.compile（“”）；
私有静态JavaStreamingContext createContext（字符串ip，
国际港口，
字符串检查点目录，
字符串输出路径）{
//如果未看到打印的内容，则表示StreamingContext已加载
//从新的检查站
System.out.println（“创建新上下文”）；
File outputFile=新文件（outputPath）；
if（outputFile.exists（））{
outputFile.delete（）；
}
SparkConf SparkConf=新的SparkConf（）.setAppName（“JavaRecoverableNetworkWordCount”）；
//创建具有1秒批量大小的上下文
JavaStreamingContext ssc=新的JavaStreamingContext（sparkConf，Durations.seconds（1））；
ssc.检查点（检查点目录）；
//在目标ip:端口上创建套接字流并计算
//分隔文本的输入流中的单词（例如由“nc”生成）
JavaReceiverInputDStream lines=ssc.socketTextStream（ip，端口）；
JavaDStream words=lines.flatMap（x->Arrays.asList（SPACE.split（x））.iterator（））；
JavaPairDStream wordCounts=words.mapToPair（s->new Tuple2（s，1））
.还原基（（i1，i2）->i1+i2）；
foreachRDD（（rdd，time）->{
//获取或注册黑名单广播
广播黑名单=
getInstance（新的JavaSparkContext（rdd.context（））；
//获取或注册DroppedWordsCenter累加器
长蓄能器下降式词中心=
getInstance（新的JavaSparkContext（rdd.context（））；
//使用黑名单删除单词，并使用DroppedWordsCenter对单词进行计数
字符串计数=rdd.filter（字计数->{
if（blacklist.value（）包含（wordCount.\u 1（））{
droppedWordsCenter.add（wordCount._2（））；
返回false；
}否则{
返回true；
}
}).collect（）.toString（）；
   StreamingQuery bookingRequestsParquetStreamWriter = ds2.writeStream().outputMode("append") 
        .format("parquet") 
        .trigger(ProcessingTime.create(bookingRequestProcessingTime, TimeUnit.MILLISECONDS)) 
        .option("checkpointLocation",  "s3://" + s3Bucket+ "/checkpoint/bookingRequests") 
        .partitionBy("eventDate") 
        .start("s3://" + s3Bucket+ "/" +  bookingRequestPath); 



    StreamingQuery PageViewsParquetStreamWriter = ds3.writeStream().outputMode("append") 
        .format("parquet") 
        .trigger(ProcessingTime.create(pageViewProcessingTime, TimeUnit.MILLISECONDS)) 
        .option("checkpointLocation",  "s3://" + s3Bucket+ "/checkpoint/PageViews") 
        .partitionBy("eventDate") 
        .start("s3://" + s3Bucket+ "/" +  pageViewPath); 

    bookingRequestsParquetStreamWriter.awaitTermination(); 

    PageViewsParquetStreamWriter.awaitTermination(); 

1) Use getOrCreate API to create your spark streaming session.
   a) getOrCreate takes two parameters. A function "(....) => sparkSession" and the checkpoint directory.
2) When program starts for the first time, it uses the checkpoint directory to store it's inner details. (Among other uses)
3) When program crashes/"stops and restarted", the spark Streaming session is created from the checkpoint hence giving you what you want.

 /**
            * Counts words in text encoded with UTF8 received from the network every second. This example also
            * shows how to use lazily instantiated singleton instances for Accumulator and Broadcast so that
            * they can be registered on driver failures.
            *
            * Usage: JavaRecoverableNetworkWordCount <hostname> <port> <checkpoint-directory> <output-file>
            *   <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive
            *   data. <checkpoint-directory> directory to HDFS-compatible file system which checkpoint data
            *   <output-file> file to which the word counts will be appended
            *
            * <checkpoint-directory> and <output-file> must be absolute paths
            *
            * To run this on your local machine, you need to first run a Netcat server
            *
            *      `$ nc -lk 9999`
            *
            * and run the example as
            *
            *      `$ ./bin/run-example org.apache.spark.examples.streaming.JavaRecoverableNetworkWordCount \
            *              localhost 9999 ~/checkpoint/ ~/out`
            *
            * If the directory ~/checkpoint/ does not exist (e.g. running for the first time), it will create
            * a new StreamingContext (will print "Creating new context" to the console). Otherwise, if
            * checkpoint data exists in ~/checkpoint/, then it will create StreamingContext from
            * the checkpoint data.
            *
            * Refer to the online documentation for more details.
            */
            public final class JavaRecoverableNetworkWordCount {
            private static final Pattern SPACE = Pattern.compile(" ");

            private static JavaStreamingContext createContext(String ip,
                                                                int port,
                                                                String checkpointDirectory,
                                                                String outputPath) {

                // If you do not see this printed, that means the StreamingContext has been loaded
                // from the new checkpoint
                System.out.println("Creating new context");
                File outputFile = new File(outputPath);
                if (outputFile.exists()) {
                outputFile.delete();
                }
                SparkConf sparkConf = new SparkConf().setAppName("JavaRecoverableNetworkWordCount");
                // Create the context with a 1 second batch size
                JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
                ssc.checkpoint(checkpointDirectory);

                // Create a socket stream on target ip:port and count the
                // words in input stream of \n delimited text (eg. generated by 'nc')
                JavaReceiverInputDStream<String> lines = ssc.socketTextStream(ip, port);
                JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
                JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
                    .reduceByKey((i1, i2) -> i1 + i2);

                wordCounts.foreachRDD((rdd, time) -> {
                // Get or register the blacklist Broadcast
                Broadcast<List<String>> blacklist =
                    JavaWordBlacklist.getInstance(new JavaSparkContext(rdd.context()));
                // Get or register the droppedWordsCounter Accumulator
                LongAccumulator droppedWordsCounter =
                    JavaDroppedWordsCounter.getInstance(new JavaSparkContext(rdd.context()));
                // Use blacklist to drop words and use droppedWordsCounter to count them
                String counts = rdd.filter(wordCount -> {
                    if (blacklist.value().contains(wordCount._1())) {
                    droppedWordsCounter.add(wordCount._2());
                    return false;
                    } else {
                    return true;
                    }
                }).collect().toString();
                String output = "Counts at time " + time + " " + counts;
                System.out.println(output);
                System.out.println("Dropped " + droppedWordsCounter.value() + " word(s) totally");
                System.out.println("Appending to " + outputFile.getAbsolutePath());
                Files.append(output + "\n", outputFile, Charset.defaultCharset());
                });

                return ssc;
            }

            public static void main(String[] args) throws Exception {
                if (args.length != 4) {
                System.err.println("You arguments were " + Arrays.asList(args));
                System.err.println(
                    "Usage: JavaRecoverableNetworkWordCount <hostname> <port> <checkpoint-directory>\n" +
                    "     <output-file>. <hostname> and <port> describe the TCP server that Spark\n" +
                    "     Streaming would connect to receive data. <checkpoint-directory> directory to\n" +
                    "     HDFS-compatible file system which checkpoint data <output-file> file to which\n" +
                    "     the word counts will be appended\n" +
                    "\n" +
                    "In local mode, <master> should be 'local[n]' with n > 1\n" +
                    "Both <checkpoint-directory> and <output-file> must be absolute paths");
                System.exit(1);
                }

                String ip = args[0];
                int port = Integer.parseInt(args[1]);
                String checkpointDirectory = args[2];
                String outputPath = args[3];

                // Function to create JavaStreamingContext without any output operations
                // (used to detect the new context)
                Function0<JavaStreamingContext> createContextFunc =
                    () -> createContext(ip, port, checkpointDirectory, outputPath);

                JavaStreamingContext ssc =
                JavaStreamingContext.getOrCreate(checkpointDirectory, createContextFunc);
                ssc.start();
                ssc.awaitTermination();
            }
            }