Apache spark 通过mapPartitions返回的JDBC查询输出
虽然这个例子很容易理解:Apache spark 通过mapPartitions返回的JDBC查询输出,apache-spark,Apache Spark,虽然这个例子很容易理解: val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = { iter.map(x => index + "," + (x, x, x+100)) } rdd1.mapPartitionsWithIndex(myfunc).collect() 我一直在尝试通过mapPa
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
iter.map(x => index + "," + (x, x, x+100))
}
rdd1.mapPartitionsWithIndex(myfunc).collect()
我一直在尝试通过mapPartitions内的JDBC调用获取一些数据,其思想是允许一些基本的并行处理。事实上,我给出的示例实际上是无效的,但为了论证起见,假设有一些JDBC源代码,比如说,一些复杂的逻辑,不适合数据帧,易于RDD处理,等等。请耐心听我说
因此,我模拟了一些调用,但与上面的示例相反,我不确定如何从数据库返回Any return参数。这是我的问题
import java.sql.DriverManager
import java.util.Properties
val rdd1 = sc.parallelize(List("G%", "C%", "I%", "B%", "X%", "F%", "J%"), 3)
def myfunc(index: Int, iter: Iterator[String]) : Iterator[Any] = {
val jdbcHostname = "mysql-rfam-public.ebi.ac.uk"
val jdbcPort = 4497
val jdbcDatabase = "Rfam"
val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
val jdbcUsername = "rfamro"
val jdbcPassword = ""
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
iter.map { x => val val1 = x;
val statement = connection.createStatement()
val resultSet = statement.executeQuery(s"""(select DISTINCT type from family where type like '${val1}' ) """)
while ( resultSet.next() ) {
val hInType = resultSet.getString("type")
}
}
}
rdd1.mapPartitionsWithIndex(myfunc).collect()
我得到的是空数据,我得到的是空数据,但我不确定我想要的是可能的还是如何修改方法。例如,我在考虑保留分区点
下面的方法当然很好,但很容易遵循——即使对我来说也是如此
iter.map(x => index + "," + (x, x, x+100))
所以,我尝试了这个,但总是得到空输出。我想我所尝试的是行不通的。我得到的印象是编译器认为它可以直接进入最后一个语句。是吗?我还假设每个分区只进行一次连接-现在还不确定
...
var fruits = new ListBuffer[String]()
iter.map { x => val val1 = x;
println (x)
val statement = connection.createStatement()
val resultSet = statement.executeQuery(s"""(select DISTINCT type from family where type like '${val1}' ) """)
while ( resultSet.next() ) {
val hInType = resultSet.getString("type")
fruits += hInType
}
}
return fruits.toList.toIterator
这是可行的,但方法完全不同,不确定上述方法是否可行
import java.util.Properties
import scala.collection.mutable.ListBuffer
import java.sql.{Connection, Driver, DriverManager, JDBCType, PreparedStatement, ResultSet, ResultSetMetaData, SQLException}
def readMatchingFromDB(record: String, connection: Connection) : String = {
var hInType: String = "XXX"
val val1 = record
val statement = connection.createStatement()
val resultSet = statement.executeQuery(s"""(select MAX(type) as type from family where type like '${val1}' ) """) // when doing MAX must do as so next line works
while ( resultSet.next() ) {
hInType = resultSet.getString("type")
}
return hInType // Only returning 1 due to MAX
}
val rdd1 = sc.parallelize(List("G%", "C%", "I%", "B%", "X%", "F%", "J%"), 3)
val newRdd = rdd1.mapPartitions(
partition => {
val jdbcHostname = "mysql-rfam-public.ebi.ac.uk"
val jdbcPort = 4497
val jdbcDatabase = "Rfam"
val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
val jdbcUsername = "rfamro"
val jdbcPassword = ""
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
}).toList
connection.close()
newPartition.toIterator
}).collect
我尝试了几个选项,但没有对Simplicity进行任何整理,这可能是不可能的…我的结论是,这不符合SPARK范例,我需要使用foreachPartition并写入临时表等。我编写了一个函数,这似乎是一条路要走,ReadMatchFromDB正如人们在其他线程上指出的那样,与占位符保留相反的实际实现是一个有趣的观点。此外,这是一个人为的例子,但我正在使用它作为一个更现实的用例。实际上,现在没有理由提出这个问题,但它可能会帮助其他人