Java 无法写入/保存数据以直接从Spark RDD点燃
我尝试使用jdbc编写数据帧来点燃 Spark版本为:2.1 点火版本:2.3 JDK:1.8 斯卡拉:2.11.8 这是我的代码片段:Java 无法写入/保存数据以直接从Spark RDD点燃,java,scala,apache-spark,jdbc,ignite,Java,Scala,Apache Spark,Jdbc,Ignite,我尝试使用jdbc编写数据帧来点燃 Spark版本为:2.1 点火版本:2.3 JDK:1.8 斯卡拉:2.11.8 这是我的代码片段: def WriteToIgnite(hiveDF:DataFrame,targetTable:String):Unit = { val conn = DataSource.conn var psmt:PreparedStatement = null try { OperationIgniteUtil.deleteIgniteData(c
def WriteToIgnite(hiveDF:DataFrame,targetTable:String):Unit = {
val conn = DataSource.conn
var psmt:PreparedStatement = null
try {
OperationIgniteUtil.deleteIgniteData(conn,targetTable)
hiveDF.foreachPartition({
partitionOfRecords => {
partitionOfRecords.foreach(
row => for ( i <- 0 until row.length ) {
psmt = OperationIgniteUtil.getInsertStatement(conn, targetTable, hiveDF.schema)
psmt.setObject(i+1, row.get(i))
psmt.execute()
}
)
}
})
}catch {
case e: Exception => e.printStackTrace()
} finally {
conn.close
}
}
def WriteToIgnite(hiveDF:DataFrame,targetTable:String):单位={
val conn=DataSource.conn
var psmt:PreparedStatement=null
试一试{
OperationIgniteUtil.deleteIgniteData(连接,目标)
前驱体({
记录的分区=>{
记录的分区。foreach(
row=>for(即printStackTrace()
}最后{
康涅狄格州
}
}
然后我在spark上运行,它会打印错误消息:
org.apache.spark.SparkException:任务不可序列化
位于org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
位于org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
位于org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
位于org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
位于org.apache.spark.rdd.rdd$$anonfun$foreachPartition$1.apply(rdd.scala:924)
位于org.apache.spark.rdd.rdd$$anonfun$foreachPartition$1.apply(rdd.scala:923)
位于org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
位于org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
位于org.apache.spark.rdd.rdd.withScope(rdd.scala:362)
位于org.apache.spark.rdd.rdd.foreachPartition(rdd.scala:923)
在org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2305)
位于org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2305)
位于org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2305)
位于org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
位于org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
位于org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2304)
在com.pingan.pilot.ignite.common.OperationIgniteUtil$.WriteToIgnite上(OperationIgniteUtil.scala:72)
在com.pingan.pilot.ignite.etl.HdfsToIgnite$.main上(HdfsToIgnite.scala:36)
位于com.pingan.pilot.ignite.etl.HdfsToIgnite.main(HdfsToIgnite.scala)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:498)
位于org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
位于org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
位于org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
位于org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)上,由以下原因引起:java.io.NotSerializableException:
org.apache.ignite.internal.jdbc2.jdbc连接序列化堆栈:
-对象不可序列化(类:org.apache.ignite.internal.jdbc2.JdbcConnection,值:
org.apache.ignite.internal.jdbc2。JdbcConnection@7ebc2975)
-字段(类:com.pingan.pilot.ignite.common.OperationIgniteUtil$$anonfun$WriteToIgnite$1,
名称:conn$1,类型:interface(java.sql.Connection)
-对象(类com.pingan.pilot.ignite.common.OperationIgniteUtil$$anonfun$WriteToIgnite$1,
)
位于org.apache.spark.serializer.SerializationDebugger$.ImproveeException(SerializationDebugger.scala:40)
位于org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
位于org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
位于org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
…还有27个
有人知道我要修吗?
谢谢!您必须扩展可序列化接口
object Test extends Serializable {
def WriteToIgnite(hiveDF:DataFrame,targetTable:String):Unit = {
???
}
}
我希望它能解决您的问题。这里的问题是您无法序列化连接到Ignite
DataSource.conn
。您提供给forEachPartition
的闭包将连接作为其作用域的一部分,这就是Spark无法序列化它的原因
幸运的是,Ignite提供了RDD的自定义实现,允许您将值保存到其中。您需要首先创建IgniteContext
,然后检索Ignite的共享RDD,该RDD提供对Ignite的分布式访问,以保存RDD的行
:
val igniteContext = new IgniteContext(sparkContext, () => new IgniteConfiguration())
...
// Retrieve Ignite's shared RDD
val igniteRdd = igniteContext.fromCache("partitioned")
igniteRDD.saveValues(hiveDF.toRDD)
有关更多信息,请访问。这应该会有所帮助