Scala 如何将数组[String]转换为正确的模式？_Scala_Apache Spark_Apache Spark Sql

Scala 如何将数组[String]转换为正确的模式？

scala apache-spark

Scala 如何将数组[String]转换为正确的模式？,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,在尝试将字段从RDD[Array[String]]转换为模式中指定的正确值以转换为Spark SQLDataFrame时，我遇到了一个奇怪的问题我有一个RDD[Array[String]]和一个名为schema的StructType，用于指定服务器字段的类型。到目前为止，我所做的是： sqlContext.createDataFrame( inputLines.map( rowValues => RowFactory.crea

在尝试将字段从

RDD[Array[String]]

转换为模式中指定的正确值以转换为Spark SQL

DataFrame

时，我遇到了一个奇怪的问题

我有一个

RDD[Array[String]]

和一个名为

schema

的

StructType

，用于指定服务器字段的类型。到目前为止，我所做的是：

sqlContext.createDataFrame(
    inputLines.map( rowValues => 
                          RowFactory.create(rowValues.zip(schema.toSeq)
                                                     .map{ case (value, struct) => 
                                                  struct.dataType match {
                                                    case BinaryType => value.toCharArray().map(ch => ch.toByte)
                                                    case ByteType => value.toByte
                                                    case BooleanType => value.toBoolean
                                                    case DoubleType => value.toDouble
                                                    case FloatType => value.toFloat
                                                    case ShortType => value.toShort
                                                    case DateType => value
                                                    case IntegerType => value.toInt
                                                    case LongType => value.toLong
                                                    case _ => value
                                                  }
                                               })), schema)

但我有一个例外：

java.lang.RuntimeException: Failed to convert value [Ljava.lang.Object;@6e9ffad1 (class of class [Ljava.lang.Object;}) with the type of IntegerType to JSON

调用

toJSON

方法时

你知道发生这种情况的原因吗？我能做些什么来解决它

正如所问，这里我们有一个例子：

val schema = StructType(Seq(StructField("id",IntegerType),StructField("val",StringType)))
val inputLines=sc.parallelize(
      Array("1","This is a line for testing"), 
      Array("2","The second line"))

您正在将

数组

作为唯一参数传递给

行工厂。创建

如果您看到它的方法签名：

public static Row create(Object ... values)

它需要一个

varargs

列表

因此，您只需要使用

：*

语法将数组转换为varargs列表

sqlContext.createDataFrame(inputLines.map( rowValues => 
   Row(              // RowFactory.create is java api, use Row.apply instead
      rowValues.zip(schema.toSeq)
                .map{ case (value, struct) => struct.dataType match {
                   case BinaryType => value.toCharArray().map(ch => ch.toByte)
                   case ByteType => value.toByte
                   case BooleanType => value.toBoolean
                   case DoubleType => value.toDouble
                   case FloatType => value.toFloat
                   case ShortType => value.toShort
                   case DateType => value
                   case IntegerType => value.toInt
                   case LongType => value.toLong
                   case _ => value
                   }
                 } : _*            // <-- make varargs here
   )),
   schema)

一个示例输入（

schema

，

inputLines

）会很有帮助。我遇到了一个异常，异常是

val-inputLines=sc.parallelize（数组（“1”，“这是一条测试线”）、数组（“2”，“第二行”）

——我想应该是：

val-inputLines=sc.parallelize（数组（“1”，“这是一条测试线”）、（“第二行”））

def convertTypes(value: String, struct: StructField): Any = struct.dataType match {
  case BinaryType => value.toCharArray().map(ch => ch.toByte)
  case ByteType => value.toByte
  case BooleanType => value.toBoolean
  case DoubleType => value.toDouble
  case FloatType => value.toFloat
  case ShortType => value.toShort
  case DateType => value
  case IntegerType => value.toInt
  case LongType => value.toLong
  case _ => value
}

val schema = StructType(Seq(StructField("id",IntegerType),
                            StructField("val",StringType)))

val inputLines = sc.parallelize(Array(Array("1","This is a line for testing"), 
                                      Array("2","The second line")))

val rowRdd = inputLines.map{ array => 
  Row.fromSeq(array.zip(schema.toSeq)
                   .map{ case (value, struct) => 
                           convertTypes(value, struct) })
}

val df = sqlContext.createDataFrame(rowRdd, schema)

df.toJSON.collect 
// Array({"id":1,"val":"This is a line for testing"},
//       {"id":2,"val":"The second line"})