Apache spark 在远程主节点中运行forEachPartition时,Apache Spark java.lang.ClassCastException
我有一个Java微服务,它连接到ApacheSpark集群,并使用Datastax Spark Cassandra连接器将数据持久化到ApacheSpark DB集群 我编写了以下方法来从Cassandra表中删除特定日期范围的数据 具体代码如下所示:Apache spark 在远程主节点中运行forEachPartition时,Apache Spark java.lang.ClassCastException,apache-spark,foreach,classcastexception,delete-row,spark-cassandra-connector,Apache Spark,Foreach,Classcastexception,Delete Row,Spark Cassandra Connector,我有一个Java微服务,它连接到ApacheSpark集群,并使用Datastax Spark Cassandra连接器将数据持久化到ApacheSpark DB集群 我编写了以下方法来从Cassandra表中删除特定日期范围的数据 具体代码如下所示: public void deleteData(String fromDate, String toDate) { SparkConf conf = sparkSession.sparkContext().getConf();
public void deleteData(String fromDate, String toDate) {
SparkConf conf = sparkSession.sparkContext().getConf();
CassandraConnector connector = CassandraConnector.apply(conf);
Dataset<Row> df = sparkSession.read().format("org.apache.spark.sql.cassandra").options(new HashMap<String, String>() {{
put("keyspace", CassandraProperties.KEYSPACE);
put("table", CassandraProperties.ENERGY_FORECASTS);
}}).load()
.filter(col("timestamp")
.substr(1, 10)
.between(fromDate, toDate))
.select("nodeid");
df.foreachPartition(partition -> {
Session session = connector.openSession();
while (partition.hasNext()) {
Row row = partition.next();
session.execute("DELETE FROM " + CassandraProperties.KEYSPACE + "." + CassandraProperties.ENERGY_FORECASTS + " WHERE nodeid = '" + row.mkString() + "' AND timestamp >= '" + fromDate + "' AND timestamp <= '" + toDate + "'");
}
session.close();
});
}
使用本地spark主节点(.master(“local[*])
选项)运行时,代码执行良好
但是,当我尝试在连接到远程spark master节点时执行相同的代码时,会发生以下错误:
驱动程序stacktrace:]的根本原因为java.lang.ClassCastException:
无法将java.lang.invoke.SerializedLambda的实例分配给字段
org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2.func$4 of
在中键入org.apache.spark.api.java.function.ForeachPartitionFunction
org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2的实例
在
ObjectStreamClass$FieldReflector.SetObjjFieldValues(ObjectStreamClass.java:2287)
在
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1417)
在
ObjectInputStream.defaultReadFields(ObjectInputStream.java:2293)
在
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
在
ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
在
ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
在
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
在
ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
在
ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
在
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
在
ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
在
ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
在
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
在
ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
位于java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
在
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
在
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
位于org.apache.spark.scheduler.Task.run(Task.scala:123)
org.apache.spark.executor.executor$TaskRunner$$anonfun$10.apply(executor.scala:408)
位于org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
在
org.apache.spark.executor.executor$TaskRunner.run(executor.scala:414)
在
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
在
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
运行(Thread.java:748)[pool-18-Thread-1]信息
com.datastax.spark.connector.cql.cassandrac连接器-断开
来自Cassandra集群:测试集群
更新1
看来我的诀窍是在sparkSession配置中添加了以下行:
.config("spark.jars", "meter-service-1.0.jar")
这似乎提供了缺少的依赖项,这些依赖项阻止Spark在远程节点上正确反序列化lamda表达式
这是更好的解释我的JAVA很脆弱,但是你能尝试将lambda提取到一个方法中吗
public void deleteData(String fromDate, String toDate) {
SparkConf conf = sparkSession.sparkContext().getConf();
CassandraConnector connector = CassandraConnector.apply(conf);
Dataset<Row> df = sparkSession.read().format("org.apache.spark.sql.cassandra").options(new HashMap<String, String>() {{
put("keyspace", CassandraProperties.KEYSPACE);
put("table", CassandraProperties.ENERGY_FORECASTS);
}}).load()
.filter(col("timestamp")
.substr(1, 10)
.between(fromDate, toDate))
.select("nodeid");
df.foreachPartition(new ForeachPartitionFunction<Row>() {
public void call(Iterator<Row> partition) {
Session session = connector.openSession();
while (partition.hasNext()) {
Row row = partition.next();
session.execute("DELETE FROM " + CassandraProperties.KEYSPACE + "." + CassandraProperties.ENERGY_FORECASTS + " WHERE nodeid = '" + row.mkString() + "' AND timestamp >= '" + fromDate + "' AND timestamp <= '" + toDate + "'");
}
session.close();
}
});
}
public void deleteData(字符串fromDate,字符串toDate){
SparkConf conf=sparkSession.sparkContext().getConf();
CassandraConnector connector=CassandraConnector.apply(conf);
Dataset df=sparkSession.read().format(“org.apache.spark.sql.cassandra”).options(新的HashMap()){{
put(“键空间”,CassandraProperties.keyspace);
put(“表格”,卡桑德拉地产公司。能源预测);
}}).load()
.filter(列(“时间戳”)
.substr(1,10)
.介于(从日期到今天)
。选择(“nodeid”);
foreachPartition(新的ForeachPartitionFunction(){
公共void调用(迭代器分区){
会话=connector.openSession();
while(partition.hasNext()){
Row=partition.next();
session.execute(“从“+CassandraProperties.KEYSPACE+”中删除”+CassandraProperties.ENERGY\u预测+”,其中nodeid='“+row.mkString()+”,时间戳>='”+fromDate+“'和timestamp感谢您的回复。这也是我第一次尝试,但是当我尝试此操作时,遇到了java.io.NotSerializableException。这是否回答了您的问题?
public void deleteData(String fromDate, String toDate) {
SparkConf conf = sparkSession.sparkContext().getConf();
CassandraConnector connector = CassandraConnector.apply(conf);
Dataset<Row> df = sparkSession.read().format("org.apache.spark.sql.cassandra").options(new HashMap<String, String>() {{
put("keyspace", CassandraProperties.KEYSPACE);
put("table", CassandraProperties.ENERGY_FORECASTS);
}}).load()
.filter(col("timestamp")
.substr(1, 10)
.between(fromDate, toDate))
.select("nodeid");
df.foreachPartition(new ForeachPartitionFunction<Row>() {
public void call(Iterator<Row> partition) {
Session session = connector.openSession();
while (partition.hasNext()) {
Row row = partition.next();
session.execute("DELETE FROM " + CassandraProperties.KEYSPACE + "." + CassandraProperties.ENERGY_FORECASTS + " WHERE nodeid = '" + row.mkString() + "' AND timestamp >= '" + fromDate + "' AND timestamp <= '" + toDate + "'");
}
session.close();
}
});
}