Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Apache Spark流错误:在删除BlockRDD的块后尝试使用它_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark Apache Spark流错误:在删除BlockRDD的块后尝试使用它

Apache spark Apache Spark流错误:在删除BlockRDD的块后尝试使用它,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我正在尝试运行一个apache流程序,该程序接收数据流,进行一些处理,将其保存在缓存中,并将数据与下一个流进行比较。我的程序在下一个流中执行第一批程序,该程序通过以下错误退出 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenExchange hashpartitioning(at org.apache.spark.sql.catalyst.errors.package

我正在尝试运行一个apache流程序,该程序接收数据流,进行一些处理,将其保存在缓存中,并将数据与下一个流进行比较。我的程序在下一个流中执行第一批程序,该程序通过以下错误退出

org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
 execute, tree: TungstenExchange hashpartitioning(at
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
         at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
 Caused by: org.apache.spark.SparkException: Job aborted due to stage
 failure: Task creation failed: org.apache.spark.SparkException:
 Attempted to use BlockRDD[1] at socketTextStream at
 StreamExample.java:56 after its blocks have been removed!
 org.apache.spark.rdd.BlockRDD.assertValid(BlockRDD.scala:83)
 org.apache.spark.rdd.BlockRDD.getPreferredLocations(BlockRDD.scala:56)
 org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
 org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
 scala.Option.getOrElse(Option.scala:120)
 org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1556)
我试图将流数据连接到现有的缓存数据,并使用join语句执行一些操作,然后再次缓存它

static DataFrame active=null;    
foreachRDD
    {

        DataFrame x=sqlContext.read().json(rdd);
       if(active==null)
    {
    active=x;
    }
    else
    {
    Dataframe f=active.join(x)
    other functions using join statements
    active.persist()
    }

请你发一段代码片段好吗?不清楚是比较两个流,还是比较同一流中不同批次的数据。(如果后者使用的是
streamingContext。请记住(有时)
,以便跨批持久化数据,或者使用窗口操作?)您实际想要做什么?看起来您在一个数据流中跨RDD进行连接,而这个数据流又像一个有状态/窗口操作。是的。我认为这是一个窗口操作。我可以通过窗口实现这一点。我希望有一个连续的在线申请,而不是指定的时间。