Apache spark 如何将RDD[Array[Any]]转换为数据帧？_Apache Spark

Apache spark 如何将RDD[Array[Any]]转换为数据帧？

apache-spark

Apache spark 如何将RDD[Array[Any]]转换为数据帧？,apache-spark,Apache Spark,我的RDD[Array[Any]]如下所示 1556273771,Mumbai,1189193,1189198,0.56,-1,India,Australia,1571215104,1571215166 8374749403,London,1189193,1189198,0,1,India,England,4567362933,9374749392 7439430283,Dubai,1189193,1189198,0.76,-1,Pakistan,Sri Lanka,1576615684,474

我的RDD[Array[Any]]如下所示

1556273771,Mumbai,1189193,1189198,0.56,-1,India,Australia,1571215104,1571215166
8374749403,London,1189193,1189198,0,1,India,England,4567362933,9374749392
7439430283,Dubai,1189193,1189198,0.76,-1,Pakistan,Sri Lanka,1576615684,4749383749

我需要将其转换为10列的数据帧，但我不熟悉spark。请让我知道如何用最简单的方法做这件事

我正在尝试类似于此代码的操作：

rdd_data.map{case Array(a,b,c,d,e,f,g,h,i,j) => (a,b,c,d,e,f,g,h,i,j)}.toDF()

您可以尝试下面的方法，这有点棘手，但不必考虑模式。使用

toDF（）

将

Any

映射到

String

，创建数组的

DataFrame

，然后通过从数组列中选择每个元素来创建新列

  val rdd: RDD[Array[Any]] = spark.range(5).rdd.map(s => Array(s,s+1,s%2))
  val size = rdd.first().length

  def splitCol(col: Column): Seq[(String, Column)] = {
    (for (i <- 0 to size - 1) yield ("_" + i, col(i)))
  }

  import spark.implicits._

  rdd.map(s=>s.map(s=>s.toString()))
    .toDF("x")
    .select(splitCol('x).map(_._2):_*)
    .toDF(splitCol('x).map(_._1):_*)
    .show()

+---+---+---+
| _0| _1| _2|
+---+---+---+
|  0|  1|  0|
|  1|  2|  1|
|  2|  3|  0|
|  3|  4|  1|
|  4|  5|  0|
+---+---+---+

valrdd:rdd[Array[Any]]=spark.range（5）.rdd.map（s=>Array（s，s+1，s%2））
val size=rdd.first（）.length
def splitCol（列：列）：Seq[（字符串，列）]={
（对于（i.s.map（s=>s.toString（）））
.toDF（“x”）
.select（splitCol（'x）.map（u._2）：*）
.toDF（splitCol（'x）.map（u._1）：*）
.show（）
+---+---+---+
| _0| _1| _2|
+---+---+---+
|  0|  1|  0|
|  1|  2|  1|
|  2|  3|  0|
|  3|  4|  1|
|  4|  5|  0|
+---+---+---+

您可以尝试下面的方法，这有点棘手，但不必考虑模式。使用

toDF（）

将

Any

映射到

String

，创建数组的

DataFrame

，然后通过从数组列中选择每个元素来创建新列

  val rdd: RDD[Array[Any]] = spark.range(5).rdd.map(s => Array(s,s+1,s%2))
  val size = rdd.first().length

  def splitCol(col: Column): Seq[(String, Column)] = {
    (for (i <- 0 to size - 1) yield ("_" + i, col(i)))
  }

  import spark.implicits._

  rdd.map(s=>s.map(s=>s.toString()))
    .toDF("x")
    .select(splitCol('x).map(_._2):_*)
    .toDF(splitCol('x).map(_._1):_*)
    .show()

+---+---+---+
| _0| _1| _2|
+---+---+---+
|  0|  1|  0|
|  1|  2|  1|
|  2|  3|  0|
|  3|  4|  1|
|  4|  5|  0|
+---+---+---+

valrdd:rdd[Array[Any]]=spark.range（5）.rdd.map（s=>Array（s，s+1，s%2））
val size=rdd.first（）.length
def splitCol（列：列）：Seq[（字符串，列）]={
（对于（i.s.map（s=>s.toString（）））
.toDF（“x”）
.select（splitCol（'x）.map（u._2）：*）
.toDF（splitCol（'x）.map（u._1）：*）
.show（）
+---+---+---+
| _0| _1| _2|
+---+---+---+
|  0|  1|  0|
|  1|  2|  1|
|  2|  3|  0|
|  3|  4|  1|
|  4|  5|  0|
+---+---+---+

当您创建数据框时，Spark需要知道每列的数据类型。“Any”类型只是表示您不知道变量类型。一种可能的解决方案是将每个值转换为特定类型。如果指定的转换无效，这当然会失败

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val rdd1 = spark.sparkContext.parallelize(
    Array(
        Array(1556273771L,"Mumbai",1189193,1189198 ,0.56,-1,"India",   "Australia",1571215104L,1571215166L),
        Array(8374749403L,"London",1189193,1189198 ,0   , 1,"India",   "England",  4567362933L,9374749392L),
        Array(7439430283L,"Dubai" ,1189193,1189198 ,0.76,-1,"Pakistan","Sri Lanka",1576615684L,4749383749L)
    ),1)
//rdd1: org.apache.spark.rdd.RDD[Array[Any]]

val rdd2 = rdd1.map(r => Row(
    r(0).toString.toLong, 
    r(1).toString, 
    r(2).toString.toInt, 
    r(3).toString.toInt, 
    r(4).toString.toDouble, 
    r(5).toString.toInt, 
    r(6).toString, 
    r(7).toString, 
    r(8).toString.toLong, 
    r(9).toString.toLong
))


val schema = StructType(
List(
    StructField("col0", LongType, false),
    StructField("col1", StringType, false),
    StructField("col2", IntegerType, false),
    StructField("col3", IntegerType, false),
    StructField("col4", DoubleType, false),
    StructField("col5", IntegerType, false),
    StructField("col6", StringType, false),
    StructField("col7", StringType, false),
    StructField("col8", LongType, false),
    StructField("col9", LongType, false)
  ) 
)

val df = spark.createDataFrame(rdd2, schema)

df.show
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|      col0|  col1|   col2|   col3|col4|col5|    col6|     col7|      col8|      col9|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56|  -1|   India|Australia|1571215104|1571215166|
|8374749403|London|1189193|1189198| 0.0|   1|   India|  England|4567362933|9374749392|
|7439430283| Dubai|1189193|1189198|0.76|  -1|Pakistan|Sri Lanka|1576615684|4749383749|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+

df.printSchema
root
 |-- col0: long (nullable = false)
 |-- col1: string (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: double (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: string (nullable = false)
 |-- col7: string (nullable = false)
 |-- col8: long (nullable = false)
 |-- col9: long (nullable = false)

希望它对您有所帮助

当您创建数据帧时，Spark需要知道每个列的数据类型。“Any”类型只是表示您不知道变量类型的一种方式。一种可能的解决方案是将每个值强制转换为特定类型。如果指定的强制转换无效，这当然会失败

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val rdd1 = spark.sparkContext.parallelize(
    Array(
        Array(1556273771L,"Mumbai",1189193,1189198 ,0.56,-1,"India",   "Australia",1571215104L,1571215166L),
        Array(8374749403L,"London",1189193,1189198 ,0   , 1,"India",   "England",  4567362933L,9374749392L),
        Array(7439430283L,"Dubai" ,1189193,1189198 ,0.76,-1,"Pakistan","Sri Lanka",1576615684L,4749383749L)
    ),1)
//rdd1: org.apache.spark.rdd.RDD[Array[Any]]

val rdd2 = rdd1.map(r => Row(
    r(0).toString.toLong, 
    r(1).toString, 
    r(2).toString.toInt, 
    r(3).toString.toInt, 
    r(4).toString.toDouble, 
    r(5).toString.toInt, 
    r(6).toString, 
    r(7).toString, 
    r(8).toString.toLong, 
    r(9).toString.toLong
))


val schema = StructType(
List(
    StructField("col0", LongType, false),
    StructField("col1", StringType, false),
    StructField("col2", IntegerType, false),
    StructField("col3", IntegerType, false),
    StructField("col4", DoubleType, false),
    StructField("col5", IntegerType, false),
    StructField("col6", StringType, false),
    StructField("col7", StringType, false),
    StructField("col8", LongType, false),
    StructField("col9", LongType, false)
  ) 
)

val df = spark.createDataFrame(rdd2, schema)

df.show
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|      col0|  col1|   col2|   col3|col4|col5|    col6|     col7|      col8|      col9|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56|  -1|   India|Australia|1571215104|1571215166|
|8374749403|London|1189193|1189198| 0.0|   1|   India|  England|4567362933|9374749392|
|7439430283| Dubai|1189193|1189198|0.76|  -1|Pakistan|Sri Lanka|1576615684|4749383749|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+

df.printSchema
root
 |-- col0: long (nullable = false)
 |-- col1: string (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: double (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: string (nullable = false)
 |-- col7: string (nullable = false)
 |-- col8: long (nullable = false)
 |-- col9: long (nullable = false)

希望它能有所帮助

正如其他帖子所提到的，DataFrame需要每个列的显式类型，因此您不能使用任何类型。我能想到的最简单的方法是将每一行转换为正确类型的元组，然后使用隐式DF创建转换为DataFrame。您的代码非常接近，只需要将元素转换为n可接受的类型

基本上

toDF

知道如何将元组（具有可接受的类型）转换为DF行，并且可以将列名传递到

toDF

调用中

例如：

val data = Array(1556273771, "Mumbai", 1189193, 1189198, 0.56, -1, "India,Australia", 1571215104, 1571215166)
val rdd = sc.parallelize(Seq(data))

val df = rdd.map {
    case Array(a,b,c,d,e,f,g,h,i) => (
        a.asInstanceOf[Int],
        b.asInstanceOf[String],
        c.asInstanceOf[Int],
        d.asInstanceOf[Int],
        e.toString.toDouble,
        f.asInstanceOf[Int],
        g.asInstanceOf[String],
        h.asInstanceOf[Int],
        i.asInstanceOf[Int]
    )
}.toDF("int1", "city", "int2", "int3", "float1", "int4", "country", "int5", "int6")

df.printSchema
df.show(100, false)


scala> df.printSchema
root
 |-- int1: integer (nullable = false)
 |-- city: string (nullable = true)
 |-- int2: integer (nullable = false)
 |-- int3: integer (nullable = false)
 |-- float1: double (nullable = false)
 |-- int4: integer (nullable = false)
 |-- country: string (nullable = true)
 |-- int5: integer (nullable = false)
 |-- int6: integer (nullable = false)


scala> df.show(100, false)
+----------+------+-------+-------+------+----+---------------+----------+----------+
|int1      |city  |int2   |int3   |float1|int4|country        |int5      |int6      |
+----------+------+-------+-------+------+----+---------------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56  |-1  |India,Australia|1571215104|1571215166|
+----------+------+-------+-------+------+----+---------------+----------+----------+

为0编辑->双精度：

正如André指出的，如果以0作为Any开始，它将是一个java整数，而不是scala Int，因此不能转换为scala Double。首先将其转换为字符串，然后根据需要将其转换为Double。

正如其他帖子所提到的，DataFrame要求每个列都有显式类型，因此不能使用Any我能想到的最简单的方法是将每一行转换成一个正确类型的元组，然后使用隐式DF创建来转换为一个数据帧

基本上

toDF

知道如何将元组（具有可接受的类型）转换为DF行，并且可以将列名传递到

toDF

调用中

例如：

val data = Array(1556273771, "Mumbai", 1189193, 1189198, 0.56, -1, "India,Australia", 1571215104, 1571215166)
val rdd = sc.parallelize(Seq(data))

val df = rdd.map {
    case Array(a,b,c,d,e,f,g,h,i) => (
        a.asInstanceOf[Int],
        b.asInstanceOf[String],
        c.asInstanceOf[Int],
        d.asInstanceOf[Int],
        e.toString.toDouble,
        f.asInstanceOf[Int],
        g.asInstanceOf[String],
        h.asInstanceOf[Int],
        i.asInstanceOf[Int]
    )
}.toDF("int1", "city", "int2", "int3", "float1", "int4", "country", "int5", "int6")

df.printSchema
df.show(100, false)


scala> df.printSchema
root
 |-- int1: integer (nullable = false)
 |-- city: string (nullable = true)
 |-- int2: integer (nullable = false)
 |-- int3: integer (nullable = false)
 |-- float1: double (nullable = false)
 |-- int4: integer (nullable = false)
 |-- country: string (nullable = true)
 |-- int5: integer (nullable = false)
 |-- int6: integer (nullable = false)


scala> df.show(100, false)
+----------+------+-------+-------+------+----+---------------+----------+----------+
|int1      |city  |int2   |int3   |float1|int4|country        |int5      |int6      |
+----------+------+-------+-------+------+----+---------------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56  |-1  |India,Australia|1571215104|1571215166|
+----------+------+-------+-------+------+----+---------------+----------+----------+

为0编辑->双精度：

注意，在名为“float1”的列中，OP的值可能为（0.56、0、0.76）。此方法将失败，因为您试图将整数转换为双精度。一个快速解决方法是将“e.asInstanceOf[Double]”更改为“e.toString.toDouble”0（作为Int）可以转换为双精度：

scala>0。asInstanceOf[Double]

-->

res1:Double=0.0

，但您并没有将0“作为Int”转换为双精度。您正在将0转换为“Any”。请尝试以下操作：

（0:Any）。可以在此处找到[Double]

的安装更详细的解释：啊，是的，我现在明白了。很好。我将编辑答案注意，在名为“float1”的列中，OP的可能值为（0.56，0，0.76）。此方法将失败，因为您试图将整数转换为Double。快速修复方法是更改“e.asInstanceOf[Double]”到“e.toString.toDouble”0（作为Int）可以转换为Double:

scala>0.asInstanceOf[Double]

-->

res1:Double=0.0

但您并没有将0“作为Int”转换为Double。您正在将0“作为Any”转换为Any。请尝试以下操作：

（0:Any）。asInstanceOf[Double]

可以在这里找到更详细的解释：啊，是的，我现在明白了。很好的理解。我将编辑答案。这非常好用，非常容易理解。谢谢你的帮助！！这非常好用，非常容易理解。谢谢你的帮助！！