Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何根据其他列的值在Dataframe中添加列_Scala_Apache Spark_Dataframe_Apache Spark Sql - Fatal编程技术网

Scala 如何根据其他列的值在Dataframe中添加列

Scala 如何根据其他列的值在Dataframe中添加列,scala,apache-spark,dataframe,apache-spark-sql,Scala,Apache Spark,Dataframe,Apache Spark Sql,我有一个数据框,列“Age”的类型为String,我想得到一个新的列,其中包含String格式的范围 范围如下 [-1,12,17,24,34,44,54,641001000] 例如,输入值 Age ===== -1 12 18 28 38 46 ====== 所需输出 Age Age-Range ===== ========= -1 (-1,12) 12 (-1,12) 18 (12-17) 28 (24-34) 38

我有一个数据框,列“Age”的类型为String,我想得到一个新的列,其中包含String格式的范围

范围如下

[-1,12,17,24,34,44,54,641001000]

例如,输入值

Age
=====  
-1
12
18
28
38
46
======
所需输出

  Age    Age-Range
 =====  ========= 
 -1     (-1,12)
 12     (-1,12)
 18     (12-17) 
 28     (24-34)
 38     (34-44)
 46     (44-54) 
======  ==========

非常感谢您的任何建议或帮助

这里有一个快速建议,我希望它能帮助您:

大小写类AgeRange(下限:Int,上限:Int){
def contains(value:Int):Boolean=value>=lowerBound&&valueAgrange(列表(0),列表(1))).toList
val数据集=序列(“-1”、“12”、“18”、“28”、“38”、“46”)。toDS
def findRange(值:Int,ageRanges:List[AgeRange]):Option[AgeRange]=ageRanges.find(u.contains(值))
//与UDF
def myUdf(ageRanges:List[AgeRange])=udf{
i:Int=>findRange(i,Agranges)
}
val result1=dataset.toDF(“年龄”)。带列(“年龄范围”,myUdf(范围)(列(“年龄”).cast(“int”))
//带地图
val result2=dataset.map{
i:String=>(i,findRange(i.toInt,ranges))
}.toDF(“年龄”、“年龄范围”)
导致:

result1: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>]
result2: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>]
+---+---------+
|age|age_range|
+---+---------+
| -1|  [-1,12]|
| 12|  [12,17]|
| 18|  [17,24]|
| 28|  [24,34]|
| 38|  [34,44]|
| 46|  [44,54]|
+---+---------+
result1:org.apache.spark.sql.DataFrame=[age:string,age\u range:struct]
结果2:org.apache.spark.sql.DataFrame=[age:string,age\u range:struct]
+---+---------+
|年龄|年龄|范围|
+---+---------+
| -1|  [-1,12]|
| 12|  [12,17]|
| 18|  [17,24]|
| 28|  [24,34]|
| 38|  [34,44]|
| 46|  [44,54]|
+---+---------+

这里有一个快速的建议,我希望它能有所帮助:

大小写类AgeRange(下限:Int,上限:Int){
def contains(value:Int):Boolean=value>=lowerBound&&valueAgrange(列表(0),列表(1))).toList
val数据集=序列(“-1”、“12”、“18”、“28”、“38”、“46”)。toDS
def findRange(值:Int,ageRanges:List[AgeRange]):Option[AgeRange]=ageRanges.find(u.contains(值))
//与UDF
def myUdf(ageRanges:List[AgeRange])=udf{
i:Int=>findRange(i,Agranges)
}
val result1=dataset.toDF(“年龄”)。带列(“年龄范围”,myUdf(范围)(列(“年龄”).cast(“int”))
//带地图
val result2=dataset.map{
i:String=>(i,findRange(i.toInt,ranges))
}.toDF(“年龄”、“年龄范围”)
导致:

result1: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>]
result2: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>]
+---+---------+
|age|age_range|
+---+---------+
| -1|  [-1,12]|
| 12|  [12,17]|
| 18|  [17,24]|
| 28|  [24,34]|
| 38|  [34,44]|
| 46|  [44,54]|
+---+---------+
result1:org.apache.spark.sql.DataFrame=[age:string,age\u range:struct]
结果2:org.apache.spark.sql.DataFrame=[age:string,age\u range:struct]
+---+---------+
|年龄|年龄|范围|
+---+---------+
| -1|  [-1,12]|
| 12|  [12,17]|
| 18|  [17,24]|
| 28|  [24,34]|
| 38|  [34,44]|
| 46|  [44,54]|
+---+---------+

您可以将udf函数用作

def range = udf((age: String) => {
  val array = Array(-1, 12, 17, 24, 34, 44, 54, 64, 100, 1000)
  val ageInt = age.toInt
  array.filter(i => i <= ageInt).last.toString+"-"+array.filter(i => i > ageInt).head.toString
})
您应该将输出设置为

+---+---------+
|Age|Age-Range|
+---+---------+
|-1 |-1-12    |
|12 |12-17    |
|18 |17-24    |
|28 |24-34    |
|38 |34-44    |
|46 |44-54    |
+---+---------+

最终的输出不是您所需要的,但应该为您提供足够多的想法,以获得正确的解决方案。

您可以使用udf函数作为

def range = udf((age: String) => {
  val array = Array(-1, 12, 17, 24, 34, 44, 54, 64, 100, 1000)
  val ageInt = age.toInt
  array.filter(i => i <= ageInt).last.toString+"-"+array.filter(i => i > ageInt).head.toString
})
您应该将输出设置为

+---+---------+
|Age|Age-Range|
+---+---------+
|-1 |-1-12    |
|12 |12-17    |
|18 |17-24    |
|28 |24-34    |
|38 |34-44    |
|46 |44-54    |
+---+---------+

最终的输出不是您所需要的,但应该为您提供足够多的正确解决方案的想法。

以下是使用UDF的简单解决方案,但您需要手动创建一个列表

//dataframe with column age
val df = spark.sparkContext.parallelize(Seq("-1", "12", "18", "28", "38", "38", "388", "3", "41")).toDF("Age")

val updateUDF = udf((age : String) => {
  val range = Seq(
    (-1, 12, "(-1 - 12)"),
    (12, 17, "(12 - 17)"),
    (17, 24, "(17 - 24)"),
    (24, 34, "(24 - 34)"),
    (34, 44, "(34 - 44)"),
    (44, 54, "(44 - 54)"),
    (54, 64, "(54 - 64)"),
    (64, 10, "(64 - 100)"),
    (100, 1000, "(100- 1000)")
  )
 range.map( value =>  {
   if (age.toInt >= value._1 && age.toInt < value._2) value._3
  else  ""
 }).filter(!_.equals(""))(0)

})

  df.withColumn("Age-Range", updateUDF($"Age")).show(false)

Here is the output:
+---+-----------+
|Age|Age-Range  |
+---+-----------+
|-1 |(-1 - 12)  |
|12 |(12 - 17)  |
|18 |(17 - 24)  |
|28 |(24 - 34)  |
|38 |(34 - 44)  |
|38 |(34 - 44)  |
|388|(100- 1000)|
|3  |(-1 - 12)  |
|41 |(34 - 44)  |
+---+-----------+
//具有列年龄的数据帧
val df=spark.sparkContext.parallelize(Seq(“-1”、“12”、“18”、“28”、“38”、“38”、“388”、“3”、“41”)).toDF(“年龄”)
val updateUDF=udf((年龄:字符串)=>{
val范围=序列(
(-1, 12, "(-1 - 12)"),
(12, 17, "(12 - 17)"),
(17, 24, "(17 - 24)"),
(24, 34, "(24 - 34)"),
(34, 44, "(34 - 44)"),
(44, 54, "(44 - 54)"),
(54, 64, "(54 - 64)"),
(64, 10, "(64 - 100)"),
(100, 1000, "(100- 1000)")
)
range.map(值=>{
如果(age.toInt>=value.\u 1和&age.toInt

我希望这有帮助

下面是使用UDF的简单解决方案,但您需要手动创建一个列表

//dataframe with column age
val df = spark.sparkContext.parallelize(Seq("-1", "12", "18", "28", "38", "38", "388", "3", "41")).toDF("Age")

val updateUDF = udf((age : String) => {
  val range = Seq(
    (-1, 12, "(-1 - 12)"),
    (12, 17, "(12 - 17)"),
    (17, 24, "(17 - 24)"),
    (24, 34, "(24 - 34)"),
    (34, 44, "(34 - 44)"),
    (44, 54, "(44 - 54)"),
    (54, 64, "(54 - 64)"),
    (64, 10, "(64 - 100)"),
    (100, 1000, "(100- 1000)")
  )
 range.map( value =>  {
   if (age.toInt >= value._1 && age.toInt < value._2) value._3
  else  ""
 }).filter(!_.equals(""))(0)

})

  df.withColumn("Age-Range", updateUDF($"Age")).show(false)

Here is the output:
+---+-----------+
|Age|Age-Range  |
+---+-----------+
|-1 |(-1 - 12)  |
|12 |(12 - 17)  |
|18 |(17 - 24)  |
|28 |(24 - 34)  |
|38 |(34 - 44)  |
|38 |(34 - 44)  |
|388|(100- 1000)|
|3  |(-1 - 12)  |
|41 |(34 - 44)  |
+---+-----------+
//具有列年龄的数据帧
val df=spark.sparkContext.parallelize(Seq(“-1”、“12”、“18”、“28”、“38”、“38”、“388”、“3”、“41”)).toDF(“年龄”)
val updateUDF=udf((年龄:字符串)=>{
val范围=序列(
(-1, 12, "(-1 - 12)"),
(12, 17, "(12 - 17)"),
(17, 24, "(17 - 24)"),
(24, 34, "(24 - 34)"),
(34, 44, "(34 - 44)"),
(44, 54, "(44 - 54)"),
(54, 64, "(54 - 64)"),
(64, 10, "(64 - 100)"),
(100, 1000, "(100- 1000)")
)
range.map(值=>{
如果(age.toInt>=value.\u 1和&age.toInt
我希望这有帮助