Python Spark streaming updateStateWithKey失败_Python_Apache Spark_Pyspark_Spark Streaming

Python Spark streaming updateStateWithKey失败

python apache-spark pyspark

Python Spark streaming updateStateWithKey失败,python,apache-spark,pyspark,spark-streaming,Python,Apache Spark,Pyspark,Spark Streaming,我正在做一项任务，在这个任务中，我需要在PythonSpark流作业中跨批运行数据总量。我正在使用updateStateByKey（在下面代码的末尾）：导入系统从pyspark导入SparkContext，SparkConf 从pyspark.streaming导入StreamingContext 从pyspark.streaming.kafka导入KafkaUtils 导入操作系统如果名称=“\uuuuu main\uuuuuuuu”： #创建火花上下文 sc=SparkContext（

我正在做一项任务，在这个任务中，我需要在PythonSpark流作业中跨批运行数据总量。我正在使用updateStateByKey（在下面代码的末尾）：

导入系统从pyspark导入SparkContext，SparkConf 从pyspark.streaming导入StreamingContext 从pyspark.streaming.kafka导入KafkaUtils 导入操作系统如果名称=“\uuuuu main\uuuuuuuu”： #创建火花上下文 sc=SparkContext（appName=“PythonStreamingDirectKafkaCount”） ssc=StreamingContext（sc，1） #备份检查点 ssc.检查点（“file:///tmp/spark") 代理，topic=sys.argv[1:] 印刷品（经纪人）打印（主题） sc.setLogLevel（“警告”） #连接到卡夫卡 kafkaParams={“metadata.broker.list”：brokers} kafkaStream=KafkaUtils.createDirectStream（ssc，[主题]，kafkaParams） def解析日志行（行）：（uuid、时间戳、url、用户、区域、浏览器、平台、cd、ttf）=line.strip（）.split（“，”）小时=时间戳[0:13] 返回（url，1） lines=kafkaStream.map（lambda x:x[1]） parsed_line=lines.map（parse_log_line） clicks=解析的_行。reduceByKey（λa，b:a+b） clicks.pprint（） def countKeys（新值、上次总和）：如果lastSum为无： lastSum=0 返回和（newValues，lastSum） #问题就在这里 sum_clicks=clicks.updateStateByKey（countKey） #我试过了，但没用 #sum_clicks=clicks.updateStateByKey（countkey，numPartitions=2） sum_clicks.pprint（） ssc.start（） ssc.终止协议（） ssc.停止（）调用pprint（）时会显示错误消息的相关部分，但我认为这只是因为这会触发计算。错误是：

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Checkpoint RDD has a different number of partitions from original RDD. Original RDD [ID: 195, num of partitions: 2]; Checkpoint RDD [ID: 267, num of partitions: 0].

它显示了原始RDD和检查点RDD中的分区数量是不同的——但是我尝试指定numPartitions=2，但没有任何区别

有人知道我做错了什么吗？谢谢