Apache spark 如何将RDD[Array[Any]]转换为数据帧?
我的RDD[Array[Any]]如下所示Apache spark 如何将RDD[Array[Any]]转换为数据帧?,apache-spark,Apache Spark,我的RDD[Array[Any]]如下所示 1556273771,Mumbai,1189193,1189198,0.56,-1,India,Australia,1571215104,1571215166 8374749403,London,1189193,1189198,0,1,India,England,4567362933,9374749392 7439430283,Dubai,1189193,1189198,0.76,-1,Pakistan,Sri Lanka,1576615684,474
1556273771,Mumbai,1189193,1189198,0.56,-1,India,Australia,1571215104,1571215166
8374749403,London,1189193,1189198,0,1,India,England,4567362933,9374749392
7439430283,Dubai,1189193,1189198,0.76,-1,Pakistan,Sri Lanka,1576615684,4749383749
我需要将其转换为10列的数据帧,但我不熟悉spark。请让我知道如何用最简单的方法做这件事
我正在尝试类似于此代码的操作:
rdd_data.map{case Array(a,b,c,d,e,f,g,h,i,j) => (a,b,c,d,e,f,g,h,i,j)}.toDF()
您可以尝试下面的方法,这有点棘手,但不必考虑模式。 使用
toDF()
将Any
映射到String
,创建数组的DataFrame
,然后通过从数组列中选择每个元素来创建新列
val rdd: RDD[Array[Any]] = spark.range(5).rdd.map(s => Array(s,s+1,s%2))
val size = rdd.first().length
def splitCol(col: Column): Seq[(String, Column)] = {
(for (i <- 0 to size - 1) yield ("_" + i, col(i)))
}
import spark.implicits._
rdd.map(s=>s.map(s=>s.toString()))
.toDF("x")
.select(splitCol('x).map(_._2):_*)
.toDF(splitCol('x).map(_._1):_*)
.show()
+---+---+---+
| _0| _1| _2|
+---+---+---+
| 0| 1| 0|
| 1| 2| 1|
| 2| 3| 0|
| 3| 4| 1|
| 4| 5| 0|
+---+---+---+
valrdd:rdd[Array[Any]]=spark.range(5).rdd.map(s=>Array(s,s+1,s%2))
val size=rdd.first().length
def splitCol(列:列):Seq[(字符串,列)]={
(对于(i.s.map(s=>s.toString()))
.toDF(“x”)
.select(splitCol('x).map(u._2):*)
.toDF(splitCol('x).map(u._1):*)
.show()
+---+---+---+
| _0| _1| _2|
+---+---+---+
| 0| 1| 0|
| 1| 2| 1|
| 2| 3| 0|
| 3| 4| 1|
| 4| 5| 0|
+---+---+---+
您可以尝试下面的方法,这有点棘手,但不必考虑模式。
使用toDF()
将Any
映射到String
,创建数组的DataFrame
,然后通过从数组列中选择每个元素来创建新列
val rdd: RDD[Array[Any]] = spark.range(5).rdd.map(s => Array(s,s+1,s%2))
val size = rdd.first().length
def splitCol(col: Column): Seq[(String, Column)] = {
(for (i <- 0 to size - 1) yield ("_" + i, col(i)))
}
import spark.implicits._
rdd.map(s=>s.map(s=>s.toString()))
.toDF("x")
.select(splitCol('x).map(_._2):_*)
.toDF(splitCol('x).map(_._1):_*)
.show()
+---+---+---+
| _0| _1| _2|
+---+---+---+
| 0| 1| 0|
| 1| 2| 1|
| 2| 3| 0|
| 3| 4| 1|
| 4| 5| 0|
+---+---+---+
valrdd:rdd[Array[Any]]=spark.range(5).rdd.map(s=>Array(s,s+1,s%2))
val size=rdd.first().length
def splitCol(列:列):Seq[(字符串,列)]={
(对于(i.s.map(s=>s.toString()))
.toDF(“x”)
.select(splitCol('x).map(u._2):*)
.toDF(splitCol('x).map(u._1):*)
.show()
+---+---+---+
| _0| _1| _2|
+---+---+---+
| 0| 1| 0|
| 1| 2| 1|
| 2| 3| 0|
| 3| 4| 1|
| 4| 5| 0|
+---+---+---+
当您创建数据框时,Spark需要知道每列的数据类型。“Any”类型只是表示您不知道变量类型。一种可能的解决方案是将每个值转换为特定类型。如果指定的转换无效,这当然会失败
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd1 = spark.sparkContext.parallelize(
Array(
Array(1556273771L,"Mumbai",1189193,1189198 ,0.56,-1,"India", "Australia",1571215104L,1571215166L),
Array(8374749403L,"London",1189193,1189198 ,0 , 1,"India", "England", 4567362933L,9374749392L),
Array(7439430283L,"Dubai" ,1189193,1189198 ,0.76,-1,"Pakistan","Sri Lanka",1576615684L,4749383749L)
),1)
//rdd1: org.apache.spark.rdd.RDD[Array[Any]]
val rdd2 = rdd1.map(r => Row(
r(0).toString.toLong,
r(1).toString,
r(2).toString.toInt,
r(3).toString.toInt,
r(4).toString.toDouble,
r(5).toString.toInt,
r(6).toString,
r(7).toString,
r(8).toString.toLong,
r(9).toString.toLong
))
val schema = StructType(
List(
StructField("col0", LongType, false),
StructField("col1", StringType, false),
StructField("col2", IntegerType, false),
StructField("col3", IntegerType, false),
StructField("col4", DoubleType, false),
StructField("col5", IntegerType, false),
StructField("col6", StringType, false),
StructField("col7", StringType, false),
StructField("col8", LongType, false),
StructField("col9", LongType, false)
)
)
val df = spark.createDataFrame(rdd2, schema)
df.show
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
| col0| col1| col2| col3|col4|col5| col6| col7| col8| col9|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56| -1| India|Australia|1571215104|1571215166|
|8374749403|London|1189193|1189198| 0.0| 1| India| England|4567362933|9374749392|
|7439430283| Dubai|1189193|1189198|0.76| -1|Pakistan|Sri Lanka|1576615684|4749383749|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
df.printSchema
root
|-- col0: long (nullable = false)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: integer (nullable = false)
|-- col4: double (nullable = false)
|-- col5: integer (nullable = false)
|-- col6: string (nullable = false)
|-- col7: string (nullable = false)
|-- col8: long (nullable = false)
|-- col9: long (nullable = false)
希望它对您有所帮助当您创建数据帧时,Spark需要知道每个列的数据类型。“Any”类型只是表示您不知道变量类型的一种方式。一种可能的解决方案是将每个值强制转换为特定类型。如果指定的强制转换无效,这当然会失败
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd1 = spark.sparkContext.parallelize(
Array(
Array(1556273771L,"Mumbai",1189193,1189198 ,0.56,-1,"India", "Australia",1571215104L,1571215166L),
Array(8374749403L,"London",1189193,1189198 ,0 , 1,"India", "England", 4567362933L,9374749392L),
Array(7439430283L,"Dubai" ,1189193,1189198 ,0.76,-1,"Pakistan","Sri Lanka",1576615684L,4749383749L)
),1)
//rdd1: org.apache.spark.rdd.RDD[Array[Any]]
val rdd2 = rdd1.map(r => Row(
r(0).toString.toLong,
r(1).toString,
r(2).toString.toInt,
r(3).toString.toInt,
r(4).toString.toDouble,
r(5).toString.toInt,
r(6).toString,
r(7).toString,
r(8).toString.toLong,
r(9).toString.toLong
))
val schema = StructType(
List(
StructField("col0", LongType, false),
StructField("col1", StringType, false),
StructField("col2", IntegerType, false),
StructField("col3", IntegerType, false),
StructField("col4", DoubleType, false),
StructField("col5", IntegerType, false),
StructField("col6", StringType, false),
StructField("col7", StringType, false),
StructField("col8", LongType, false),
StructField("col9", LongType, false)
)
)
val df = spark.createDataFrame(rdd2, schema)
df.show
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
| col0| col1| col2| col3|col4|col5| col6| col7| col8| col9|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56| -1| India|Australia|1571215104|1571215166|
|8374749403|London|1189193|1189198| 0.0| 1| India| England|4567362933|9374749392|
|7439430283| Dubai|1189193|1189198|0.76| -1|Pakistan|Sri Lanka|1576615684|4749383749|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
df.printSchema
root
|-- col0: long (nullable = false)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: integer (nullable = false)
|-- col4: double (nullable = false)
|-- col5: integer (nullable = false)
|-- col6: string (nullable = false)
|-- col7: string (nullable = false)
|-- col8: long (nullable = false)
|-- col9: long (nullable = false)
希望它能有所帮助正如其他帖子所提到的,DataFrame需要每个列的显式类型,因此您不能使用任何类型。我能想到的最简单的方法是将每一行转换为正确类型的元组,然后使用隐式DF创建转换为DataFrame。您的代码非常接近,只需要将元素转换为n可接受的类型 基本上
toDF
知道如何将元组(具有可接受的类型)转换为DF行,并且可以将列名传递到toDF
调用中
例如:
val data = Array(1556273771, "Mumbai", 1189193, 1189198, 0.56, -1, "India,Australia", 1571215104, 1571215166)
val rdd = sc.parallelize(Seq(data))
val df = rdd.map {
case Array(a,b,c,d,e,f,g,h,i) => (
a.asInstanceOf[Int],
b.asInstanceOf[String],
c.asInstanceOf[Int],
d.asInstanceOf[Int],
e.toString.toDouble,
f.asInstanceOf[Int],
g.asInstanceOf[String],
h.asInstanceOf[Int],
i.asInstanceOf[Int]
)
}.toDF("int1", "city", "int2", "int3", "float1", "int4", "country", "int5", "int6")
df.printSchema
df.show(100, false)
scala> df.printSchema
root
|-- int1: integer (nullable = false)
|-- city: string (nullable = true)
|-- int2: integer (nullable = false)
|-- int3: integer (nullable = false)
|-- float1: double (nullable = false)
|-- int4: integer (nullable = false)
|-- country: string (nullable = true)
|-- int5: integer (nullable = false)
|-- int6: integer (nullable = false)
scala> df.show(100, false)
+----------+------+-------+-------+------+----+---------------+----------+----------+
|int1 |city |int2 |int3 |float1|int4|country |int5 |int6 |
+----------+------+-------+-------+------+----+---------------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56 |-1 |India,Australia|1571215104|1571215166|
+----------+------+-------+-------+------+----+---------------+----------+----------+
为0编辑->双精度:
正如André指出的,如果以0作为Any开始,它将是一个java整数,而不是scala Int,因此不能转换为scala Double。首先将其转换为字符串,然后根据需要将其转换为Double。正如其他帖子所提到的,DataFrame要求每个列都有显式类型,因此不能使用Any我能想到的最简单的方法是将每一行转换成一个正确类型的元组,然后使用隐式DF创建来转换为一个数据帧 基本上
toDF
知道如何将元组(具有可接受的类型)转换为DF行,并且可以将列名传递到toDF
调用中
例如:
val data = Array(1556273771, "Mumbai", 1189193, 1189198, 0.56, -1, "India,Australia", 1571215104, 1571215166)
val rdd = sc.parallelize(Seq(data))
val df = rdd.map {
case Array(a,b,c,d,e,f,g,h,i) => (
a.asInstanceOf[Int],
b.asInstanceOf[String],
c.asInstanceOf[Int],
d.asInstanceOf[Int],
e.toString.toDouble,
f.asInstanceOf[Int],
g.asInstanceOf[String],
h.asInstanceOf[Int],
i.asInstanceOf[Int]
)
}.toDF("int1", "city", "int2", "int3", "float1", "int4", "country", "int5", "int6")
df.printSchema
df.show(100, false)
scala> df.printSchema
root
|-- int1: integer (nullable = false)
|-- city: string (nullable = true)
|-- int2: integer (nullable = false)
|-- int3: integer (nullable = false)
|-- float1: double (nullable = false)
|-- int4: integer (nullable = false)
|-- country: string (nullable = true)
|-- int5: integer (nullable = false)
|-- int6: integer (nullable = false)
scala> df.show(100, false)
+----------+------+-------+-------+------+----+---------------+----------+----------+
|int1 |city |int2 |int3 |float1|int4|country |int5 |int6 |
+----------+------+-------+-------+------+----+---------------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56 |-1 |India,Australia|1571215104|1571215166|
+----------+------+-------+-------+------+----+---------------+----------+----------+
为0编辑->双精度:
正如André指出的,如果以0作为Any开始,它将是一个java整数,而不是scala Int,因此不能转换为scala Double。首先将其转换为字符串,然后根据需要将其转换为Double。注意,在名为“float1”的列中,OP的值可能为(0.56、0、0.76)。此方法将失败,因为您试图将整数转换为双精度。一个快速解决方法是将“e.asInstanceOf[Double]”更改为“e.toString.toDouble”0(作为Int)可以转换为双精度:
scala>0。asInstanceOf[Double]
-->res1:Double=0.0
,但您并没有将0“作为Int”转换为双精度。您正在将0转换为“Any”。请尝试以下操作:(0:Any)。可以在此处找到[Double]
的安装更详细的解释:啊,是的,我现在明白了。很好。我将编辑答案注意,在名为“float1”的列中,OP的可能值为(0.56,0,0.76)。此方法将失败,因为您试图将整数转换为Double。快速修复方法是更改“e.asInstanceOf[Double]”到“e.toString.toDouble”0(作为Int)可以转换为Double:scala>0.asInstanceOf[Double]
-->res1:Double=0.0
但您并没有将0“作为Int”转换为Double。您正在将0“作为Any”转换为Any。请尝试以下操作:(0:Any)。asInstanceOf[Double]
可以在这里找到更详细的解释:啊,是的,我现在明白了。很好的理解。我将编辑答案。这非常好用,非常容易理解。谢谢你的帮助!!这非常好用,非常容易理解。谢谢你的帮助!!