Python Spark：使用映射从复杂数据帧模式中获取数据_Python_Scala_Apache Spark_Pyspark

Python Spark：使用映射从复杂数据帧模式中获取数据

python scala apache-spark pyspark

Python Spark：使用映射从复杂数据帧模式中获取数据,python,scala,apache-spark,pyspark,Python,Scala,Apache Spark,Pyspark,我有一个如下的结构 json.select($"comments").printSchema root |-- comments: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- comment: struct (nullable = true) | | | |-- date: string (nullable = true) | |

我有一个如下的结构

json.select($"comments").printSchema

 root
 |-- comments: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- comment: struct (nullable = true)
 |    |    |    |-- date: string (nullable = true)
 |    |    |    |-- score: string (nullable = true)
 |    |    |    |-- shouts: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- tags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- username: string (nullable = true)
 |    |    |-- subcomments: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- date: string (nullable = true)
 |    |    |    |    |-- score: string (nullable = true)
 |    |    |    |    |-- shouts: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- tags: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |-- username: string (nullable = true)

我想获得评论的数组/列表[用户名、分数、文本]。通常，在pyspark中，我会这样做

comments = json
 .select("comments")
 .flatMap(lambda element: 
    map(lambda comment: 
      Row(username = comment.username, 
          score = comment.score, 
          text = comment.text), 
      element[0])
 .toDF()

rror: scala.collection.mutable.WrappedArray.type does not take parameters

但是，当我在scala中尝试同样的方法时

json.select($"comments").rdd.map{row: Row => row(0)}.take(3)

我有一些奇怪的输出

Array[Any] =
Array(
  WrappedArray([[stirng,string,WrappedArray(),WrappedArray(),,string] ...],  ...)

有没有办法在scala中像在python中一样轻松地执行该任务

另外，如何像数组/列表一样迭代WrappedArray，我遇到了这样的错误

comments = json
 .select("comments")
 .flatMap(lambda element: 
    map(lambda comment: 
      Row(username = comment.username, 
          score = comment.score, 
          text = comment.text), 
      element[0])
 .toDF()

rror: scala.collection.mutable.WrappedArray.type does not take parameters

使用静态类型的

数据集

如何

case class Comment(
    date: String, score: String,
    shouts: Seq[String], tags: Seq[String],
    text: String, username: String
)

df
  .select(explode($"comments.comment").alias("comment"))
  .select("comment.*")
  .as[Comment]
  .map(c => (c.username, c.score, c.date))

如果您不依赖REPL，则可以进一步简化：

df
  .select("comments.comment")
  .as[Seq[Comment]]
  .flatMap(_.map(c => (c.username, c.score, c.text)))

如果确实要处理

行

请使用类型化getter：

df.rdd.flatMap(
  _.getAs[SR]("comments")
    .map(_.getAs[Row]("comment"))
    .map {
      // You could also _.getAs[String]("score") or getString(0)
      case Row(_, score: String, _, _, text: String, username: String) => 
        (username, score, text)
    }
)

使用静态类型的

数据集

如何

case class Comment(
    date: String, score: String,
    shouts: Seq[String], tags: Seq[String],
    text: String, username: String
)

df
  .select(explode($"comments.comment").alias("comment"))
  .select("comment.*")
  .as[Comment]
  .map(c => (c.username, c.score, c.date))

如果您不依赖REPL，则可以进一步简化：

df
  .select("comments.comment")
  .as[Seq[Comment]]
  .flatMap(_.map(c => (c.username, c.score, c.text)))

如果确实要处理

行

请使用类型化getter：

df.rdd.flatMap(
  _.getAs[SR]("comments")
    .map(_.getAs[Row]("comment"))
    .map {
      // You could also _.getAs[String]("score") or getString(0)
      case Row(_, score: String, _, _, text: String, username: String) => 
        (username, score, text)
    }
)