Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 在远程主节点中运行forEachPartition时,Apache Spark java.lang.ClassCastException_Apache Spark_Foreach_Classcastexception_Delete Row_Spark Cassandra Connector - Fatal编程技术网

Apache spark 在远程主节点中运行forEachPartition时,Apache Spark java.lang.ClassCastException

Apache spark 在远程主节点中运行forEachPartition时,Apache Spark java.lang.ClassCastException,apache-spark,foreach,classcastexception,delete-row,spark-cassandra-connector,Apache Spark,Foreach,Classcastexception,Delete Row,Spark Cassandra Connector,我有一个Java微服务,它连接到ApacheSpark集群,并使用Datastax Spark Cassandra连接器将数据持久化到ApacheSpark DB集群 我编写了以下方法来从Cassandra表中删除特定日期范围的数据 具体代码如下所示: public void deleteData(String fromDate, String toDate) { SparkConf conf = sparkSession.sparkContext().getConf();

我有一个Java微服务,它连接到ApacheSpark集群,并使用Datastax Spark Cassandra连接器将数据持久化到ApacheSpark DB集群

我编写了以下方法来从Cassandra表中删除特定日期范围的数据

具体代码如下所示:

public void deleteData(String fromDate, String toDate) {


    SparkConf conf = sparkSession.sparkContext().getConf();
    CassandraConnector connector = CassandraConnector.apply(conf);

    Dataset<Row> df = sparkSession.read().format("org.apache.spark.sql.cassandra").options(new HashMap<String, String>() {{
        put("keyspace", CassandraProperties.KEYSPACE);
        put("table", CassandraProperties.ENERGY_FORECASTS);
    }}).load()
            .filter(col("timestamp")
                    .substr(1, 10)
                    .between(fromDate, toDate))
            .select("nodeid");


    df.foreachPartition(partition -> {
        Session session = connector.openSession();
        while (partition.hasNext()) {
            Row row = partition.next();
            session.execute("DELETE FROM " + CassandraProperties.KEYSPACE + "." + CassandraProperties.ENERGY_FORECASTS + " WHERE nodeid = '" + row.mkString() + "' AND timestamp >= '" + fromDate + "' AND timestamp <= '" + toDate + "'");
        }
        session.close();
    });
}
使用本地spark主节点(
.master(“local[*])
选项)运行时,代码执行良好

但是,当我尝试在连接到远程spark master节点时执行相同的代码时,会发生以下错误:

驱动程序stacktrace:]的根本原因为java.lang.ClassCastException: 无法将java.lang.invoke.SerializedLambda的实例分配给字段 org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2.func$4 of 在中键入org.apache.spark.api.java.function.ForeachPartitionFunction org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2的实例 在 ObjectStreamClass$FieldReflector.SetObjjFieldValues(ObjectStreamClass.java:2287) 在 java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1417) 在 ObjectInputStream.defaultReadFields(ObjectInputStream.java:2293) 在 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) 在 ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) 位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) 在 ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) 在 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) 在 ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) 位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) 在 ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) 在 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) 在 ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) 位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) 在 ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) 在 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) 在 ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) 位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) 位于java.io.ObjectInputStream.readObject(ObjectInputStream.java:431) 在 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) 在 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) 位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83) 位于org.apache.spark.scheduler.Task.run(Task.scala:123) org.apache.spark.executor.executor$TaskRunner$$anonfun$10.apply(executor.scala:408) 位于org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 在 org.apache.spark.executor.executor$TaskRunner.run(executor.scala:414) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 运行(Thread.java:748)[pool-18-Thread-1]信息 com.datastax.spark.connector.cql.cassandrac连接器-断开 来自Cassandra集群:测试集群

更新1

看来我的诀窍是在sparkSession配置中添加了以下行:

.config("spark.jars", "meter-service-1.0.jar")
这似乎提供了缺少的依赖项,这些依赖项阻止Spark在远程节点上正确反序列化lamda表达式


这是更好的解释

我的JAVA很脆弱,但是你能尝试将lambda提取到一个方法中吗

public void deleteData(String fromDate, String toDate) {
    SparkConf conf = sparkSession.sparkContext().getConf();
    CassandraConnector connector = CassandraConnector.apply(conf);

    Dataset<Row> df = sparkSession.read().format("org.apache.spark.sql.cassandra").options(new HashMap<String, String>() {{
        put("keyspace", CassandraProperties.KEYSPACE);
        put("table", CassandraProperties.ENERGY_FORECASTS);
    }}).load()
        .filter(col("timestamp")
                .substr(1, 10)
                .between(fromDate, toDate))
        .select("nodeid");


    df.foreachPartition(new ForeachPartitionFunction<Row>() {
        public void call(Iterator<Row> partition) {
            Session session = connector.openSession();
            while (partition.hasNext()) {
                Row row = partition.next();
                session.execute("DELETE FROM " + CassandraProperties.KEYSPACE + "." + CassandraProperties.ENERGY_FORECASTS + " WHERE nodeid = '" + row.mkString() + "' AND timestamp >= '" + fromDate + "' AND timestamp <= '" + toDate + "'");
            }
            session.close();
        }
    });
}
public void deleteData(字符串fromDate,字符串toDate){
SparkConf conf=sparkSession.sparkContext().getConf();
CassandraConnector connector=CassandraConnector.apply(conf);
Dataset df=sparkSession.read().format(“org.apache.spark.sql.cassandra”).options(新的HashMap()){{
put(“键空间”,CassandraProperties.keyspace);
put(“表格”,卡桑德拉地产公司。能源预测);
}}).load()
.filter(列(“时间戳”)
.substr(1,10)
.介于(从日期到今天)
。选择(“nodeid”);
foreachPartition(新的ForeachPartitionFunction(){
公共void调用(迭代器分区){
会话=connector.openSession();
while(partition.hasNext()){
Row=partition.next();

session.execute(“从“+CassandraProperties.KEYSPACE+”中删除”+CassandraProperties.ENERGY\u预测+”,其中nodeid='“+row.mkString()+”,时间戳>='”+fromDate+“'和timestamp感谢您的回复。这也是我第一次尝试,但是当我尝试此操作时,遇到了java.io.NotSerializableException。这是否回答了您的问题?
public void deleteData(String fromDate, String toDate) {
    SparkConf conf = sparkSession.sparkContext().getConf();
    CassandraConnector connector = CassandraConnector.apply(conf);

    Dataset<Row> df = sparkSession.read().format("org.apache.spark.sql.cassandra").options(new HashMap<String, String>() {{
        put("keyspace", CassandraProperties.KEYSPACE);
        put("table", CassandraProperties.ENERGY_FORECASTS);
    }}).load()
        .filter(col("timestamp")
                .substr(1, 10)
                .between(fromDate, toDate))
        .select("nodeid");


    df.foreachPartition(new ForeachPartitionFunction<Row>() {
        public void call(Iterator<Row> partition) {
            Session session = connector.openSession();
            while (partition.hasNext()) {
                Row row = partition.next();
                session.execute("DELETE FROM " + CassandraProperties.KEYSPACE + "." + CassandraProperties.ENERGY_FORECASTS + " WHERE nodeid = '" + row.mkString() + "' AND timestamp >= '" + fromDate + "' AND timestamp <= '" + toDate + "'");
            }
            session.close();
        }
    });
}