Apache spark 持续火花数据帧点燃

Apache spark 持续火花数据帧点燃,apache-spark,ignite,Apache Spark,Ignite,我想持久化Spark数据帧以点燃它。当我探索时,我遇到了点燃火花,这有助于做到这一点。但目前ignite spark仅适用于spark 2.3,而不是spark 2.4 所以我回到了传统的 df.write.format(“jdbc”) 现在,我的代码如下所示 df.write .format("jdbc") .option("url", "jdbc:ignite:thin://127.0.0.1:10800") .option("dbtable", "sample

我想持久化Spark数据帧以点燃它。当我探索时,我遇到了点燃火花,这有助于做到这一点。但目前ignite spark仅适用于spark 2.3,而不是spark 2.4

所以我回到了传统的

df.write.format(“jdbc”)

现在,我的代码如下所示

df.write
     .format("jdbc")
     .option("url", "jdbc:ignite:thin://127.0.0.1:10800")
     .option("dbtable", "sample_table")
     .option("user", "ignite")
     .option("password", "ignite")
     .mode(SaveMode.Overwrite)
     .save()
我现在面临的问题是由于在我的数据帧中缺少一个主键,这是Ignite所必需的,请建议如何克服这个问题。

下面的错误堆栈跟踪:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" java.sql.SQLException: No PRIMARY KEY defined for CREATE TABLE
    at org.apache.ignite.internal.jdbc.thin.JdbcThinConnection.sendRequest(JdbcThinConnection.java:750)
    at org.apache.ignite.internal.jdbc.thin.JdbcThinStatement.execute0(JdbcThinStatement.java:212)
    at org.apache.ignite.internal.jdbc.thin.JdbcThinStatement.executeUpdate(JdbcThinStatement.java:340)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:859)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:81)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
    at com.ev.spark.job.Ignite$.delayedEndpoint$com$ev$spark$job$Ignite$1(Ignite.scala:52)
    at com.ev.spark.job.Ignite$delayedInit$body.apply(Ignite.scala:9)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at com.ev.spark.job.Ignite$.main(Ignite.scala:9)
    at com.ev.spark.job.Ignite.main(Ignite.scala)
编辑:


我正在寻找一种解决方案,以便在保存DF之前动态创建表。在我的例子中,我的DF中已经有一个或多个字段,我必须以某种方式与Spark通信以用作表创建的主键。

如果它需要一个具有唯一值的列(作为主键),您可以自己创建它,保存数据帧,然后从Ignite中删除该列

请参考此链接(您可以使用dataframe API直接转到
):


希望有帮助

尝试预先使用创建基础Ignite表。定义一些主键,例如
id
。然后使用Spark API连接到Ignite,并使用动态创建的Ignite表。手动递增
id
并传递到DataFrames API。例如,可以用于生成唯一ID


至于不受支持的Spark 2.4版本,我已经打开了一个Ignite社区。希望该票证将于8月生效。

Spark包含多个保存模式,如果要使用的表存在,则将应用这些模式:

* Overwrite - with this option you will try to re-create existed table or create new and load data there using IgniteDataStreamer implementation
* Append - with this option you will not try to re-create existed table or create new table and just load the data to existed table
* ErrorIfExists - with this option you will get the exception if the table that you are going to use exists
* Ignore - with this option nothing will be done in case if the table that you are going to use exists. If the table already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data.
在您的示例中,您尝试通过重新创建缓存来存储数据,但没有提供Ignite表的详细信息。使用“覆盖”保存模式时,请尝试添加下一个选项:

.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "id")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "template=replicated")

另外,考虑使用append模式,不要每次都重新创建表

比尔,
安德烈

遗憾的是,这对我不起作用:(只有在我使用主键预先创建表的情况下,这才有效。我正在寻找一种解决方案,以便在保留DF之前动态创建表。在我的情况下,我的DF中已经有一个或多个字段,我必须与Spark进行通信,以便将其用作表创建的主键。您可能已经阅读了t他的(正如您所说,您已经探索了ignite spark),但无论如何都想分享它-是的,它可以工作,但不支持最新的spark版本:(这就是为什么选择JDBC路线。是的,我目前看到的唯一方法是使用所需的主键预先创建表。另外,感谢Denis将票据制作为拦截器,并帮助ignite spark使用最新的spark版本。非常感谢。该票据很有可能会被包括在8月份的发行版中。)e、 Ignite提交人正在调查您。请随时在Ignite开发列表上与我们联系。当然可以。谢谢。您提供的选项是使用Ignite spark,而不是使用传统的spark JDBC持久性。正如我所提到的,Ignite spark目前支持spark 2.3,而不是spark 2.4。