Scala 如何在dataframe中添加新列并填充该列?

Scala 如何在dataframe中添加新列并填充该列?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,将名为Download_Type的新列添加到具有以下条件的数据帧: 如果大小

将名为Download_Type的新列添加到具有以下条件的数据帧:

如果大小<100000,请下载

如果尺寸大于100000且小于1000000,请下载

否则请下载\u Type=“大型”

输入数据:log_file.txt

样本数据 “日期”、“时间”、“大小”、“r_版本”、“r_拱门”、“r_os”、“包装”、“版本”、“国家”、“ip_id” “2012-10-01”,“00:30:13”,35165,“2.15.1”,“i686”,“linux gnu”,“quadprog”,“1.5-4”,“AU”,1

我使用以下步骤创建了一个数据帧:

val file1 =  sc.textFile(“log_file.txt”)

val header = file1.first

val logdata = file1.filter(x=>x!=header)

case class Log(date:String, time:String, size: Double, r_version:String, r_arch:String, r_os:String, packagee:String, version:String, country:String, ipr:Int)

val logfiledata = logdata.map(_.split(“,”)),map(p=>Log(p(0),p(1),p(2).toDouble,p(3),p(4),p(5),p(6),p(7),p(8),p(9).toInt))

val logfiledf = logfiledata.toDF()
我隔离了size列并将其转换为数组:

val size = logfiledf.select($"size")

val sizearr = size.collect.map(row=>row.getDouble(0))
我创建了一个函数,以便填充新添加的列:

def exp1(size:Array[Double])={

var result = ""

for(i <- 0 to (size.length-1)){

if(size(i)<100000) result += "small"

else(if(size(i) >=100000 && size(i) <1000000) "medium"

else "large"

}

return result

}
如何使用以下条件填充名为Download_type的新列:

如果大小<100000,请下载

如果尺寸大于100000且小于1000000,请下载


否则请下载_Type=“Large”

您只需使用
when/other
with column
应用于加载的数据帧
logfiledf
,如下所示:

import org.apache.spark.sql.functions._
import spark.implicits._

val logfiledf = Seq(
  ("2012-10-01","00:30:13",35165.0,"2.15.1","i686","linux-gnu","quadprog","1.5-4","AU",1),
  ("2012-10-02","00:40:14",150000.0,"2.15.1","i686","linux-gnu","quadprog","1.5-4","US",2)
).toDF("date","time","size","r_version","r_arch","r_os","package","version","country","ip_id")

logfiledf.withColumn("download_type", when($"size" < 100000, "Small").otherwise(
    when($"size" < 1000000, "Medium").otherwise("Large")
  )
).show
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
// |      date|    time|    size|r_version|r_arch|     r_os| package|version|country|ip_id|download_type|
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
// |2012-10-01|00:30:13| 35165.0|   2.15.1|  i686|linux-gnu|quadprog|  1.5-4|     AU|    1|        Small|
// |2012-10-02|00:40:14|150000.0|   2.15.1|  i686|linux-gnu|quadprog|  1.5-4|     US|    2|       Medium|
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
import org.apache.spark.sql.functions_
导入spark.implicits_
val logfiledf=Seq(
(“2012-10-01”,“00:30:13”,“35165.0”,“2.15.1”,“i686”,“linux gnu”,“quadprog”,“1.5-4”,“AU”,1),
(“2012-10-02”,“00:40:14”,“150000.0”,“2.15.1”,“i686”,“linux gnu”,“quadprog”,“1.5-4”,“美国”,2)
).toDF(“日期”、“时间”、“大小”、“r_版本”、“r_拱门”、“r_操作系统”、“包装”、“版本”、“国家”、“ip_id”)
logfiledf.withColumn(“下载类型”,当($“大小”<100000,“小”)。否则(
当($“大小”<1000000,“中等”)。否则(“大”)
)
).表演
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
//|日期|时间|大小| r|U版本| r|U拱门| r|U os |套装|版本|国家| ip|U id |下载||
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
//| 2012-10-01 | 00:30:13 | 35165.0 | 2.15.1 | i686 | linux gnu | quadprog | 1.5-4 | AU | 1 | Small|
//| 2012-10-02 | 00:40:14 | 150000.0 | 2.15.1 | i686 | linux gnu | quadprog | 1.5-4 | US | 2 |中等|
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+

当我提出这个问题时,语法糟透了。我的意思是:如何在数据框中添加新列并填充新列?
import org.apache.spark.sql.functions._
import spark.implicits._

val logfiledf = Seq(
  ("2012-10-01","00:30:13",35165.0,"2.15.1","i686","linux-gnu","quadprog","1.5-4","AU",1),
  ("2012-10-02","00:40:14",150000.0,"2.15.1","i686","linux-gnu","quadprog","1.5-4","US",2)
).toDF("date","time","size","r_version","r_arch","r_os","package","version","country","ip_id")

logfiledf.withColumn("download_type", when($"size" < 100000, "Small").otherwise(
    when($"size" < 1000000, "Medium").otherwise("Large")
  )
).show
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
// |      date|    time|    size|r_version|r_arch|     r_os| package|version|country|ip_id|download_type|
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
// |2012-10-01|00:30:13| 35165.0|   2.15.1|  i686|linux-gnu|quadprog|  1.5-4|     AU|    1|        Small|
// |2012-10-02|00:40:14|150000.0|   2.15.1|  i686|linux-gnu|quadprog|  1.5-4|     US|    2|       Medium|
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+