Scala Spark从行中提取值_Scala_Apache Spark_Apache Spark Sql

Scala Spark从行中提取值

scala apache-spark

Scala Spark从行中提取值,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有以下数据帧 val transactions_with_counts = sqlContext.sql( """SELECT user_id AS user_id, category_id AS category_id, COUNT(category_id) FROM transactions GROUP BY user_id, category_id""") 我试图将行转换为分级对象，但由于x（0）返回数组，因此失败 val ratings = transactions_with

我有以下数据帧

val transactions_with_counts = sqlContext.sql(
  """SELECT user_id AS user_id, category_id AS category_id,
  COUNT(category_id) FROM transactions GROUP BY user_id, category_id""")

我试图将行转换为分级对象，但由于x（0）返回数组，因此失败

val ratings = transactions_with_counts
  .map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))

错误：值toInt不是任何

让我们从一些虚拟数据开始：

val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")

val transactions_with_counts = transactions
  .groupBy($"user_id", $"category_id")
  .count

transactions_with_counts.printSchema

// root
// |-- user_id: integer (nullable = false)
// |-- category_id: integer (nullable = false)
// |-- count: long (nullable = false)

有几种方法可以访问

行

值并保留预期类型：

模式匹配

import org.apache.spark.sql.Row

transactions_with_counts.map{
  case Row(user_id: Int, category_id: Int, rating: Long) =>
    Rating(user_id, category_id, rating)
}

键入

get*

方法，如

getInt

，

getLong

：

transactions_with_counts.map(
  r => Rating(r.getInt(0), r.getInt(1), r.getLong(2))
)

getAs

可以同时使用名称和索引的方法：

transactions_with_counts.map(r => Rating(
  r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2)
))

它可用于正确提取用户定义的类型，包括

mllib.linalg.Vector

。显然，通过名称访问需要一个模式

转换为静态类型的

数据集（Spark 1.6+/2.0+）：


使用数据集，您可以如下定义评级：
case class Rating(user_id: Int, category_id:Int, count:Long)

val transactions_with_counts = transactions.groupBy($"user_id", $"category_id").count

val rating = transactions_with_counts.as[Rating]

此处的Rating类的列名为“count”，而不是zero323建议的“Rating”。因此，评级变量分配如下：
case class Rating(user_id: Int, category_id:Int, count:Long)

val transactions_with_counts = transactions.groupBy($"user_id", $"category_id").count

val rating = transactions_with_counts.as[Rating]

这样，您就不会在Spark中遇到运行时错误，因为您的
评级类列名与Spark在运行时生成的“计数”列名相同。
要访问数据帧行的值，需要使用rdd。使用for循环收集数据帧的

假设您的数据帧如下所示
val df = Seq(
      (1,"James"),    
      (2,"Albert"),
      (3,"Pete")).toDF("user_id","name")

使用rdd。在数据框上收集。row
变量将包含rdd
行类型的数据帧的每一行。要从一行中获取每个元素，请使用row.mkString（“，”）
，它将以逗号分隔的值包含每行的值。使用split
函数（内置函数），您可以使用索引访问rdd
行的每一列值
for (row <- df.rdd.collect)
{   
    var user_id = row.mkString(",").split(",")(0)
    var category_id = row.mkString(",").split(",")(1)       
}

for（row哪一种是您提到的上述四种方法中最有效的方法？@Dilan模式匹配静态类型选项可能会慢一些（后者有一些其他性能影响）。getAs[\u]
和get*
应该类似，但使用起来很痛苦。1.这是什么意思“后一种方法还有其他一些性能含义”…？2.getAs[u]和get*在性能方面是否比模式匹配更好？我正在使用上面描述的第1种方法，用于数据帧具有如下可空列案例行（usrId:Int，usrName:String，null，usrMobile:Int）=>…
和案例行（usrId:Int，usrName:String，usrAge:Int，null）=>…
这会导致长大小写表达式（我有几个大小写）。有没有更简洁的方法（更简洁，更少样板/重复的东西）要做到这一点？请用一个例子回答。回答得很好Mr 0323。为什么要执行collect来应用转换？如果数据不适合MemoryDownVote，这将导致驱动程序崩溃，因为就我所知，为每个值拆分行效率不高。使用get*方法将避免为每个值拆分