Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 通过mapPartitions返回的JDBC查询输出_Apache Spark - Fatal编程技术网

Apache spark 通过mapPartitions返回的JDBC查询输出

Apache spark 通过mapPartitions返回的JDBC查询输出,apache-spark,Apache Spark,虽然这个例子很容易理解: val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = { iter.map(x => index + "," + (x, x, x+100)) } rdd1.mapPartitionsWithIndex(myfunc).collect() 我一直在尝试通过mapPa

虽然这个例子很容易理解:

val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
    iter.map(x => index + "," + (x, x, x+100))
}
rdd1.mapPartitionsWithIndex(myfunc).collect() 
我一直在尝试通过mapPartitions内的JDBC调用获取一些数据,其思想是允许一些基本的并行处理。事实上,我给出的示例实际上是无效的,但为了论证起见,假设有一些JDBC源代码,比如说,一些复杂的逻辑,不适合数据帧,易于RDD处理,等等。请耐心听我说

因此,我模拟了一些调用,但与上面的示例相反,我不确定如何从数据库返回Any return参数。这是我的问题

import java.sql.DriverManager
import java.util.Properties

val rdd1 = sc.parallelize(List("G%", "C%", "I%", "B%", "X%", "F%", "J%"), 3)

def myfunc(index: Int, iter: Iterator[String]) : Iterator[Any] = {

    val jdbcHostname = "mysql-rfam-public.ebi.ac.uk"
    val jdbcPort = 4497
    val jdbcDatabase = "Rfam"
    val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
    val jdbcUsername = "rfamro"
    val jdbcPassword = ""
    val connectionProperties = new Properties()
    connectionProperties.put("user", s"${jdbcUsername}")
    connectionProperties.put("password", s"${jdbcPassword}")
    val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)

    iter.map { x => val val1 = x; 
                val statement = connection.createStatement()
                val resultSet = statement.executeQuery(s"""(select DISTINCT type from family where type like '${val1}' ) """)
                while ( resultSet.next() ) {
                        val hInType = resultSet.getString("type")
                } 
             }
}

rdd1.mapPartitionsWithIndex(myfunc).collect()
我得到的是空数据,我得到的是空数据,但我不确定我想要的是可能的还是如何修改方法。例如,我在考虑保留分区点

下面的方法当然很好,但很容易遵循——即使对我来说也是如此

    iter.map(x => index + "," + (x, x, x+100))
所以,我尝试了这个,但总是得到空输出。我想我所尝试的是行不通的。我得到的印象是编译器认为它可以直接进入最后一个语句。是吗?我还假设每个分区只进行一次连接-现在还不确定

...
var fruits = new ListBuffer[String]()

iter.map { x => val val1 = x; 
                println (x)
                val statement = connection.createStatement()
                val resultSet = statement.executeQuery(s"""(select DISTINCT type from family where type like '${val1}' ) """)
                while ( resultSet.next() ) {
                        val hInType = resultSet.getString("type")
                        fruits += hInType

                } 
          }

return fruits.toList.toIterator 

这是可行的,但方法完全不同,不确定上述方法是否可行

import java.util.Properties
import scala.collection.mutable.ListBuffer
import java.sql.{Connection, Driver, DriverManager, JDBCType, PreparedStatement, ResultSet, ResultSetMetaData, SQLException}

def readMatchingFromDB(record: String, connection: Connection) : String = {

    var hInType: String = "XXX"
    val val1 = record 
    val statement = connection.createStatement()
    val resultSet = statement.executeQuery(s"""(select MAX(type) as type from family where type like '${val1}' ) """) // when doing MAX must do as so next line works

    while ( resultSet.next() ) {
            hInType = resultSet.getString("type")                       
        }   
    return hInType // Only returning 1 due to MAX
 }

val rdd1 = sc.parallelize(List("G%", "C%", "I%", "B%", "X%", "F%", "J%"), 3)
val newRdd = rdd1.mapPartitions(

      partition => {
         val jdbcHostname = "mysql-rfam-public.ebi.ac.uk"
         val jdbcPort = 4497
         val jdbcDatabase = "Rfam"
         val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
         val jdbcUsername = "rfamro"
         val jdbcPassword = ""
         val connectionProperties = new Properties()
         connectionProperties.put("user", s"${jdbcUsername}")
         connectionProperties.put("password", s"${jdbcPassword}")
         val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)

         val newPartition = partition.map(
           record => {  
                      readMatchingFromDB(record, connection)
                     }).toList

         connection.close()
         newPartition.toIterator  
     }).collect

我尝试了几个选项,但没有对Simplicity进行任何整理,这可能是不可能的…我的结论是,这不符合SPARK范例,我需要使用foreachPartition并写入临时表等。我编写了一个函数,这似乎是一条路要走,ReadMatchFromDB正如人们在其他线程上指出的那样,与占位符保留相反的实际实现是一个有趣的观点。此外,这是一个人为的例子,但我正在使用它作为一个更现实的用例。实际上,现在没有理由提出这个问题,但它可能会帮助其他人