Mysql Spark结构化流:JDBC接收器中的主键
我正在使用带更新模式的结构化流媒体读取卡夫卡主题中的数据流,然后进行一些转换 然后,我创建了一个jdbc接收器,以使用Append模式在mysql接收器中推送数据。问题是我如何告诉我的接收器让它知道这是我的主键,并基于它进行更新,这样我的表就不会有任何重复的行Mysql Spark结构化流:JDBC接收器中的主键,mysql,apache-spark,apache-spark-sql,spark-structured-streaming,apache-spark-dataset,Mysql,Apache Spark,Apache Spark Sql,Spark Structured Streaming,Apache Spark Dataset,我正在使用带更新模式的结构化流媒体读取卡夫卡主题中的数据流,然后进行一些转换 然后,我创建了一个jdbc接收器,以使用Append模式在mysql接收器中推送数据。问题是我如何告诉我的接收器让它知道这是我的主键,并基于它进行更新,这样我的表就不会有任何重复的行 val df: DataFrame = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "<List-here>")
val df: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<List-here>")
.option("subscribe", "emp-topic")
.load()
import spark.implicits._
// value in kafka is bytes so cast it to String
val empList: Dataset[Employee] = df.
selectExpr("CAST(value AS STRING)")
.map(row => Employee(row.getString(0)))
// window aggregations on 1 min windows
val aggregatedDf= ......
// How to tell here that id is my primary key and do the update
// based on id column
aggregatedDf
.writeStream
.trigger(Trigger.ProcessingTime(60.seconds))
.outputMode(OutputMode.Update)
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF
.select("id", "name","salary","dept")
.write.format("jdbc")
.option("url", "jdbc:mysql://localhost/empDb")
.option("driver","com.mysql.cj.jdbc.Driver")
.option("dbtable", "empDf")
.option("user", "root")
.option("password", "root")
.mode(SaveMode.Append)
.save()
}
一种方法是,您可以使用foreachPartition对重复密钥进行更新,这可能有助于实现此目的
下面是psuedo代码片段
/**
* Insert in to database using foreach partition.
* @param dataframe : DataFrame
* @param sqlDatabaseConnectionString
* @param sqlTableName
*/
def insertToTable(dataframe: DataFrame, sqlDatabaseConnectionString: String, sqlTableName: String): Unit = {
//numPartitions = number of simultaneous DB connections you can planning to give
datframe.repartition(numofpartitionsyouwant)
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
// Note : Each partition one connection (more better way is to use connection pools)
val sqlExecutorConnection: Connection = DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach {
group =>
val insertString: scala.collection.mutable.StringBuilder = new scala.collection.mutable.StringBuilder()
group.foreach {
record => insertString.append("('" + record.mkString(",") + "'),")
}
val sql = s"""
| INSERT INTO $sqlTableName VALUES
| $tableHeader
| ${insertString}
| ON DUPLICATE KEY UPDATE
| yourprimarykeycolumn='${record.getAs[String]("key")}'
sqlExecutorConnection.createStatement()
.executeUpdate(sql)
}
sqlExecutorConnection.close() // close the connection
}
}
您可以使用preparedstatement代替jdbc语句
进一步阅读:一种方法是,您可以使用foreachPartition在重复密钥更新上实现此目的
下面是psuedo代码片段
/**
* Insert in to database using foreach partition.
* @param dataframe : DataFrame
* @param sqlDatabaseConnectionString
* @param sqlTableName
*/
def insertToTable(dataframe: DataFrame, sqlDatabaseConnectionString: String, sqlTableName: String): Unit = {
//numPartitions = number of simultaneous DB connections you can planning to give
datframe.repartition(numofpartitionsyouwant)
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
// Note : Each partition one connection (more better way is to use connection pools)
val sqlExecutorConnection: Connection = DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach {
group =>
val insertString: scala.collection.mutable.StringBuilder = new scala.collection.mutable.StringBuilder()
group.foreach {
record => insertString.append("('" + record.mkString(",") + "'),")
}
val sql = s"""
| INSERT INTO $sqlTableName VALUES
| $tableHeader
| ${insertString}
| ON DUPLICATE KEY UPDATE
| yourprimarykeycolumn='${record.getAs[String]("key")}'
sqlExecutorConnection.createStatement()
.executeUpdate(sql)
}
sqlExecutorConnection.close() // close the connection
}
}
您可以使用preparedstatement代替jdbc语句
进一步阅读:谢谢你的回答。但是你提到的那个问题似乎是3年前的事了,所以我想知道spark 2.4.0的当前版本中是否还有其他功能。上述方法也适用于spark的当前版本,因为它的RDD级别到目前为止,我将答案标记为已接受,因为我在spark找不到其他好的选择。谢谢你的回答。但是你提到的那个问题似乎是3年前的事了,所以我想知道spark 2.4.0的当前版本中是否还有其他功能。上述方法也适用于spark的当前版本,因为它的RDD级别到目前为止,我将答案标记为已接受,因为我在spark中没有找到其他好的替代方案。