Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
带有udf返回值的scala spark数据框修改列_Scala_Dataframe_Apache Spark_User Defined Functions_Apache Spark Dataset - Fatal编程技术网

带有udf返回值的scala spark数据框修改列

带有udf返回值的scala spark数据框修改列,scala,dataframe,apache-spark,user-defined-functions,apache-spark-dataset,Scala,Dataframe,Apache Spark,User Defined Functions,Apache Spark Dataset,我有一个spark数据帧,它有一个时间戳字段,我想把它转换成长数据类型。我使用了一个UDF,独立代码可以正常工作,但当我插入到需要转换任何时间戳的通用逻辑时,我无法使其正常工作。问题是如何将UDF的返回值分配回dataframe列 下面是代码片段 val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate(); import org.apach

我有一个spark数据帧,它有一个时间戳字段,我想把它转换成长数据类型。我使用了一个UDF,独立代码可以正常工作,但当我插入到需要转换任何时间戳的通用逻辑时,我无法使其正常工作。问题是如何将UDF的返回值分配回dataframe列

下面是代码片段

    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate();
      import org.apache.spark.sql.functions._
      val sqlContext  = spark.sqlContext
      val df2 = sqlContext.jsonRDD(spark.sparkContext.parallelize(Array(
        """{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
        """{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
      )))

      val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
        manTs.getTime
      }

        df2.withColumn("manufacture_ts",getTime(df2("manufacture_ts"))).show

       +-----+----------+-----+--------------+-----+----+
        |     |No Comment|Tesla| 1508126400000|    S|2012|
        |     |   Get one| Ford| 1508126400000| E350|1997|
        |     |          |Chevy| 1508126400000| Volt|2015|
        +-----+----------+-----+--------------+-----+----+ 

    Now i want to invoke this from a dataframe to be clled on all columns which are of type long

    object Test4 extends App{

        val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test").getOrCreate();
        import spark.implicits._

        import scala.collection.JavaConversions._    
        val long : Long  = "1508299200000".toLong    

        val data = Seq(Row("10000020_LUX_OTC",long,"2020-02-14"))

        val schema = List( StructField("rowkey",StringType,true)
                                  ,StructField("order_receipt_dt",LongType,true)
                                  ,StructField("maturity_dt",StringType,true))

        val dataDF =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))

        val modifedDf2= schema.foldLeft(dataDF) { case (newDF,StructField(name,dataType,flag,metadata)) =>
          newDF.withColumn(name,DataTypeUtil.transformLong(newDF,name,dataType.typeName))
modifedDf2,show
        }

      }


      val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
        manTs.getTime
      }

      def transformLong(dataFrame: DataFrame,name:String, fieldType:String):Column = {
        import org.apache.spark.sql.functions._

        fieldType.toLowerCase match {

          case "timestamp"  => convertTimeStamp(dataFrame(name))
          case _ => dataFrame.col(name)
        }
      }

如果时间戳为空,您的udf可能会崩溃。您可以执行以下操作:

  • 使用
    unix\u时间戳
    代替UDF。。或者使您的UDF空值安全
  • 仅适用于需要转换的字段
鉴于数据:

导入spark.implicits_

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.TimestampType

val df = Seq(
  (1L,Timestamp.valueOf(LocalDateTime.now()),Timestamp.valueOf(LocalDateTime.now()))
).toDF("id","ts1","ts2")
你可以做:

val newDF = df.schema.fields.filter(_.dataType == TimestampType).map(_.name)
  .foldLeft(df)((df,field) => df.withColumn(field,unix_timestamp(col(field))))

newDF.show()
其中:

+---+----------+----------+
| id|       ts1|       ts2|
+---+----------+----------+
|  1|1589109282|1589109282|
+---+----------+----------+

你能给我解释一下你想做什么吗?你遇到了什么问题?为什么不使用
unix\u timestamp()
?你能帮忙并建议如何处理这个问题吗