Apache spark 如何将RDD[Array[Any]]转换为数据帧?

Apache spark 如何将RDD[Array[Any]]转换为数据帧?,apache-spark,Apache Spark,我的RDD[Array[Any]]如下所示 1556273771,Mumbai,1189193,1189198,0.56,-1,India,Australia,1571215104,1571215166 8374749403,London,1189193,1189198,0,1,India,England,4567362933,9374749392 7439430283,Dubai,1189193,1189198,0.76,-1,Pakistan,Sri Lanka,1576615684,474

我的RDD[Array[Any]]如下所示

1556273771,Mumbai,1189193,1189198,0.56,-1,India,Australia,1571215104,1571215166
8374749403,London,1189193,1189198,0,1,India,England,4567362933,9374749392
7439430283,Dubai,1189193,1189198,0.76,-1,Pakistan,Sri Lanka,1576615684,4749383749

我需要将其转换为10列的数据帧,但我不熟悉spark。请让我知道如何用最简单的方法做这件事

我正在尝试类似于此代码的操作:

rdd_data.map{case Array(a,b,c,d,e,f,g,h,i,j) => (a,b,c,d,e,f,g,h,i,j)}.toDF()

您可以尝试下面的方法,这有点棘手,但不必考虑模式。 使用
toDF()
Any
映射到
String
,创建数组的
DataFrame
,然后通过从数组列中选择每个元素来创建新列

  val rdd: RDD[Array[Any]] = spark.range(5).rdd.map(s => Array(s,s+1,s%2))
  val size = rdd.first().length

  def splitCol(col: Column): Seq[(String, Column)] = {
    (for (i <- 0 to size - 1) yield ("_" + i, col(i)))
  }

  import spark.implicits._

  rdd.map(s=>s.map(s=>s.toString()))
    .toDF("x")
    .select(splitCol('x).map(_._2):_*)
    .toDF(splitCol('x).map(_._1):_*)
    .show()

+---+---+---+
| _0| _1| _2|
+---+---+---+
|  0|  1|  0|
|  1|  2|  1|
|  2|  3|  0|
|  3|  4|  1|
|  4|  5|  0|
+---+---+---+
valrdd:rdd[Array[Any]]=spark.range(5).rdd.map(s=>Array(s,s+1,s%2))
val size=rdd.first().length
def splitCol(列:列):Seq[(字符串,列)]={
(对于(i.s.map(s=>s.toString()))
.toDF(“x”)
.select(splitCol('x).map(u._2):*)
.toDF(splitCol('x).map(u._1):*)
.show()
+---+---+---+
| _0| _1| _2|
+---+---+---+
|  0|  1|  0|
|  1|  2|  1|
|  2|  3|  0|
|  3|  4|  1|
|  4|  5|  0|
+---+---+---+

您可以尝试下面的方法,这有点棘手,但不必考虑模式。 使用
toDF()
Any
映射到
String
,创建数组的
DataFrame
,然后通过从数组列中选择每个元素来创建新列

  val rdd: RDD[Array[Any]] = spark.range(5).rdd.map(s => Array(s,s+1,s%2))
  val size = rdd.first().length

  def splitCol(col: Column): Seq[(String, Column)] = {
    (for (i <- 0 to size - 1) yield ("_" + i, col(i)))
  }

  import spark.implicits._

  rdd.map(s=>s.map(s=>s.toString()))
    .toDF("x")
    .select(splitCol('x).map(_._2):_*)
    .toDF(splitCol('x).map(_._1):_*)
    .show()

+---+---+---+
| _0| _1| _2|
+---+---+---+
|  0|  1|  0|
|  1|  2|  1|
|  2|  3|  0|
|  3|  4|  1|
|  4|  5|  0|
+---+---+---+
valrdd:rdd[Array[Any]]=spark.range(5).rdd.map(s=>Array(s,s+1,s%2))
val size=rdd.first().length
def splitCol(列:列):Seq[(字符串,列)]={
(对于(i.s.map(s=>s.toString()))
.toDF(“x”)
.select(splitCol('x).map(u._2):*)
.toDF(splitCol('x).map(u._1):*)
.show()
+---+---+---+
| _0| _1| _2|
+---+---+---+
|  0|  1|  0|
|  1|  2|  1|
|  2|  3|  0|
|  3|  4|  1|
|  4|  5|  0|
+---+---+---+

当您创建数据框时,Spark需要知道每列的数据类型。“Any”类型只是表示您不知道变量类型。一种可能的解决方案是将每个值转换为特定类型。如果指定的转换无效,这当然会失败

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val rdd1 = spark.sparkContext.parallelize(
    Array(
        Array(1556273771L,"Mumbai",1189193,1189198 ,0.56,-1,"India",   "Australia",1571215104L,1571215166L),
        Array(8374749403L,"London",1189193,1189198 ,0   , 1,"India",   "England",  4567362933L,9374749392L),
        Array(7439430283L,"Dubai" ,1189193,1189198 ,0.76,-1,"Pakistan","Sri Lanka",1576615684L,4749383749L)
    ),1)
//rdd1: org.apache.spark.rdd.RDD[Array[Any]]

val rdd2 = rdd1.map(r => Row(
    r(0).toString.toLong, 
    r(1).toString, 
    r(2).toString.toInt, 
    r(3).toString.toInt, 
    r(4).toString.toDouble, 
    r(5).toString.toInt, 
    r(6).toString, 
    r(7).toString, 
    r(8).toString.toLong, 
    r(9).toString.toLong
))


val schema = StructType(
List(
    StructField("col0", LongType, false),
    StructField("col1", StringType, false),
    StructField("col2", IntegerType, false),
    StructField("col3", IntegerType, false),
    StructField("col4", DoubleType, false),
    StructField("col5", IntegerType, false),
    StructField("col6", StringType, false),
    StructField("col7", StringType, false),
    StructField("col8", LongType, false),
    StructField("col9", LongType, false)
  ) 
)

val df = spark.createDataFrame(rdd2, schema)

df.show
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|      col0|  col1|   col2|   col3|col4|col5|    col6|     col7|      col8|      col9|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56|  -1|   India|Australia|1571215104|1571215166|
|8374749403|London|1189193|1189198| 0.0|   1|   India|  England|4567362933|9374749392|
|7439430283| Dubai|1189193|1189198|0.76|  -1|Pakistan|Sri Lanka|1576615684|4749383749|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+

df.printSchema
root
 |-- col0: long (nullable = false)
 |-- col1: string (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: double (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: string (nullable = false)
 |-- col7: string (nullable = false)
 |-- col8: long (nullable = false)
 |-- col9: long (nullable = false)

希望它对您有所帮助

当您创建数据帧时,Spark需要知道每个列的数据类型。“Any”类型只是表示您不知道变量类型的一种方式。一种可能的解决方案是将每个值强制转换为特定类型。如果指定的强制转换无效,这当然会失败

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val rdd1 = spark.sparkContext.parallelize(
    Array(
        Array(1556273771L,"Mumbai",1189193,1189198 ,0.56,-1,"India",   "Australia",1571215104L,1571215166L),
        Array(8374749403L,"London",1189193,1189198 ,0   , 1,"India",   "England",  4567362933L,9374749392L),
        Array(7439430283L,"Dubai" ,1189193,1189198 ,0.76,-1,"Pakistan","Sri Lanka",1576615684L,4749383749L)
    ),1)
//rdd1: org.apache.spark.rdd.RDD[Array[Any]]

val rdd2 = rdd1.map(r => Row(
    r(0).toString.toLong, 
    r(1).toString, 
    r(2).toString.toInt, 
    r(3).toString.toInt, 
    r(4).toString.toDouble, 
    r(5).toString.toInt, 
    r(6).toString, 
    r(7).toString, 
    r(8).toString.toLong, 
    r(9).toString.toLong
))


val schema = StructType(
List(
    StructField("col0", LongType, false),
    StructField("col1", StringType, false),
    StructField("col2", IntegerType, false),
    StructField("col3", IntegerType, false),
    StructField("col4", DoubleType, false),
    StructField("col5", IntegerType, false),
    StructField("col6", StringType, false),
    StructField("col7", StringType, false),
    StructField("col8", LongType, false),
    StructField("col9", LongType, false)
  ) 
)

val df = spark.createDataFrame(rdd2, schema)

df.show
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|      col0|  col1|   col2|   col3|col4|col5|    col6|     col7|      col8|      col9|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56|  -1|   India|Australia|1571215104|1571215166|
|8374749403|London|1189193|1189198| 0.0|   1|   India|  England|4567362933|9374749392|
|7439430283| Dubai|1189193|1189198|0.76|  -1|Pakistan|Sri Lanka|1576615684|4749383749|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+

df.printSchema
root
 |-- col0: long (nullable = false)
 |-- col1: string (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: double (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: string (nullable = false)
 |-- col7: string (nullable = false)
 |-- col8: long (nullable = false)
 |-- col9: long (nullable = false)

希望它能有所帮助

正如其他帖子所提到的,DataFrame需要每个列的显式类型,因此您不能使用任何类型。我能想到的最简单的方法是将每一行转换为正确类型的元组,然后使用隐式DF创建转换为DataFrame。您的代码非常接近,只需要将元素转换为n可接受的类型

基本上
toDF
知道如何将元组(具有可接受的类型)转换为DF行,并且可以将列名传递到
toDF
调用中

例如:

val data = Array(1556273771, "Mumbai", 1189193, 1189198, 0.56, -1, "India,Australia", 1571215104, 1571215166)
val rdd = sc.parallelize(Seq(data))

val df = rdd.map {
    case Array(a,b,c,d,e,f,g,h,i) => (
        a.asInstanceOf[Int],
        b.asInstanceOf[String],
        c.asInstanceOf[Int],
        d.asInstanceOf[Int],
        e.toString.toDouble,
        f.asInstanceOf[Int],
        g.asInstanceOf[String],
        h.asInstanceOf[Int],
        i.asInstanceOf[Int]
    )
}.toDF("int1", "city", "int2", "int3", "float1", "int4", "country", "int5", "int6")

df.printSchema
df.show(100, false)


scala> df.printSchema
root
 |-- int1: integer (nullable = false)
 |-- city: string (nullable = true)
 |-- int2: integer (nullable = false)
 |-- int3: integer (nullable = false)
 |-- float1: double (nullable = false)
 |-- int4: integer (nullable = false)
 |-- country: string (nullable = true)
 |-- int5: integer (nullable = false)
 |-- int6: integer (nullable = false)


scala> df.show(100, false)
+----------+------+-------+-------+------+----+---------------+----------+----------+
|int1      |city  |int2   |int3   |float1|int4|country        |int5      |int6      |
+----------+------+-------+-------+------+----+---------------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56  |-1  |India,Australia|1571215104|1571215166|
+----------+------+-------+-------+------+----+---------------+----------+----------+
为0编辑->双精度:


正如André指出的,如果以0作为Any开始,它将是一个java整数,而不是scala Int,因此不能转换为scala Double。首先将其转换为字符串,然后根据需要将其转换为Double。

正如其他帖子所提到的,DataFrame要求每个列都有显式类型,因此不能使用Any我能想到的最简单的方法是将每一行转换成一个正确类型的元组,然后使用隐式DF创建来转换为一个数据帧

基本上
toDF
知道如何将元组(具有可接受的类型)转换为DF行,并且可以将列名传递到
toDF
调用中

例如:

val data = Array(1556273771, "Mumbai", 1189193, 1189198, 0.56, -1, "India,Australia", 1571215104, 1571215166)
val rdd = sc.parallelize(Seq(data))

val df = rdd.map {
    case Array(a,b,c,d,e,f,g,h,i) => (
        a.asInstanceOf[Int],
        b.asInstanceOf[String],
        c.asInstanceOf[Int],
        d.asInstanceOf[Int],
        e.toString.toDouble,
        f.asInstanceOf[Int],
        g.asInstanceOf[String],
        h.asInstanceOf[Int],
        i.asInstanceOf[Int]
    )
}.toDF("int1", "city", "int2", "int3", "float1", "int4", "country", "int5", "int6")

df.printSchema
df.show(100, false)


scala> df.printSchema
root
 |-- int1: integer (nullable = false)
 |-- city: string (nullable = true)
 |-- int2: integer (nullable = false)
 |-- int3: integer (nullable = false)
 |-- float1: double (nullable = false)
 |-- int4: integer (nullable = false)
 |-- country: string (nullable = true)
 |-- int5: integer (nullable = false)
 |-- int6: integer (nullable = false)


scala> df.show(100, false)
+----------+------+-------+-------+------+----+---------------+----------+----------+
|int1      |city  |int2   |int3   |float1|int4|country        |int5      |int6      |
+----------+------+-------+-------+------+----+---------------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56  |-1  |India,Australia|1571215104|1571215166|
+----------+------+-------+-------+------+----+---------------+----------+----------+
为0编辑->双精度:


正如André指出的,如果以0作为Any开始,它将是一个java整数,而不是scala Int,因此不能转换为scala Double。首先将其转换为字符串,然后根据需要将其转换为Double。

注意,在名为“float1”的列中,OP的值可能为(0.56、0、0.76)。此方法将失败,因为您试图将整数转换为双精度。一个快速解决方法是将“e.asInstanceOf[Double]”更改为“e.toString.toDouble”0(作为Int)可以转换为双精度:
scala>0。asInstanceOf[Double]
-->
res1:Double=0.0
,但您并没有将0“作为Int”转换为双精度。您正在将0转换为“Any”。请尝试以下操作:
(0:Any)。可以在此处找到[Double]
的安装更详细的解释:啊,是的,我现在明白了。很好。我将编辑答案注意,在名为“float1”的列中,OP的可能值为(0.56,0,0.76)。此方法将失败,因为您试图将整数转换为Double。快速修复方法是更改“e.asInstanceOf[Double]”到“e.toString.toDouble”0(作为Int)可以转换为Double:
scala>0.asInstanceOf[Double]
-->
res1:Double=0.0
但您并没有将0“作为Int”转换为Double。您正在将0“作为Any”转换为Any。请尝试以下操作:
(0:Any)。asInstanceOf[Double]
可以在这里找到更详细的解释:啊,是的,我现在明白了。很好的理解。我将编辑答案。这非常好用,非常容易理解。谢谢你的帮助!!这非常好用,非常容易理解。谢谢你的帮助!!