Scala 如何正确地将spark rdd结果保存到mysql数据库
目前,我按照以下步骤将spark RDD结果保存到mysql数据库中Scala 如何正确地将spark rdd结果保存到mysql数据库,scala,apache-spark,Scala,Apache Spark,目前,我按照以下步骤将spark RDD结果保存到mysql数据库中 import anorm._ import java.sql.Connection import org.apache.spark.rdd.RDD val wordCounts: RDD[(String, Int)] = ... def getDbConnection(dbUrl: String): Connection = { Class.forName("com.mysql.jdbc.Driver").newIns
import anorm._
import java.sql.Connection
import org.apache.spark.rdd.RDD
val wordCounts: RDD[(String, Int)] = ...
def getDbConnection(dbUrl: String): Connection = {
Class.forName("com.mysql.jdbc.Driver").newInstance()
java.sql.DriverManager.getConnection(dbUrl)
}
def using[X <: {def close()}, A](resource : X)(f : X => A): A =
try { f(resource)
} finally { resource.close() }
wordCounts.map.foreachPartition(iter => {
using(getDbConnection(dbUrl)) { implicit conn =>
iter.foreach { case (word, count) =>
SQL"insert into WordCount VALUES(word, count)".executeUpdate()
}
}
})
请检查您正在编写的数据帧的分区数,如果分区太多或太少会影响第二种方法的性能OK,如何检查分区数?dataSet.rdd.getNumPartitions
val sqlContext = new SQLContext(sc)
val wordCountSchema = StructType(List(StructField("word", StringType, nullable = false), StructField("count", IntegerType, nullable = false)))
val wordCountRowRDD = wordCounts.map(p => org.apache.spark.sql.Row(p._1,p._2))
val wordCountDF = sqlContext.createDataFrame(wordCountRowRDD, wordCountSchema)
wordCountDF.registerTempTable("WordCount")
wordCountDF.write.mode("overwrite").jdbc(dbUrl, "WordCount", new java.util.Properties())