Apache spark 从一个数据帧中获取值并将该值传递到SqlContext的循环中_Apache Spark

Apache spark 从一个数据帧中获取值并将该值传递到SqlContext的循环中

apache-spark

Apache spark 从一个数据帧中获取值并将该值传递到SqlContext的循环中,apache-spark,Apache Spark,希望尝试这样做：我有一个数据框，它是ID的一列，称为ID_列表。有了这个id列，我想将它传递到一个sparksql调用中，使用foreach在id_列表中循环，并将结果返回到另一个数据帧 val sqlContext = new org.apache.spark.sql.SQLContext(sc) val id_list = sqlContext.sql("select distinct id from item_orc") id_list.registerTempTable("ID_LIS

希望尝试这样做：

我有一个数据框，它是ID的一列，称为ID_列表。有了这个id列，我想将它传递到一个sparksql调用中，使用foreach在id_列表中循环，并将结果返回到另一个数据帧

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)

import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row

// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
  .add("col1", StringType, true)

var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)

// Loop over, select, and union to the empty df 
id_list.foreach{ i =>
  val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
  df = df.union(items)
}
df.show()

id\u列表打印LN输出：

[123]
[234]
[345]
[456]

现在尝试循环ID_列表，并对每个ID_列表运行Spark SQL调用：

id_list.foreach(i => { 
    val items = sqlContext.sql("select * from another_items_orc where id = " + i
    items.foreach(println)
}

首先。。不确定如何提取单个值，出现以下错误：

org.apache.spark.sql.AnalysisException: cannot recognize input near '[' '123' ']' in expression specification; line 1 pos 61

第二：如何修改代码以将结果输出到稍后可以使用的数据帧

谢谢，非常感谢您的帮助

对第一个问题的回答

执行foreach时，Spark将数据帧转换为类型为的RDD。然后，当您在RDD上打印LN时，它会打印该行，第一行是[123]。它正在装箱[]行中的元素。行中的元素按位置访问。如果你只想打印123234等等。。。试一试

id_list.foreach(i => println(i(0)))

也可以使用本机原语访问

id_list.foreach(i => println(i.getString(0))) //For Strings

说真的。。。阅读我在Spark中链接的关于行的文档。这会将您的代码转换为：

id_list.foreach(i => {
  val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
  items.foreach(i => println(i.getString(0)))
})

对第二个问题的答复

我对你到底想做什么有一种隐秘的怀疑，但我会按照我的解释回答你的问题

让我们创建一个空的数据帧，我们将把它的所有内容合并到一个由不同于第一个数据帧的项组成的循环中

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)

import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row

// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
  .add("col1", StringType, true)

var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)

// Loop over, select, and union to the empty df 
id_list.foreach{ i =>
  val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
  df = df.union(items)
}
df.show()

现在有了数据帧df，以后可以使用

注意：一个更简单的方法可能是将匹配列上的两个数据帧连接起来

import sqlContext.implicits.StringToColumn
val bar = id_list.join(another_items_orc, $"distinct_id" === $"id", "inner").select("id")
bar.show()

item_orc中id列的数据类型是什么？能否将缺少的右括号添加到代码中？谢谢，这非常有用。