Scala 火花计数参数“;提到;成排
以下是示例DF:Scala 火花计数参数“;提到;成排,scala,apache-spark,Scala,Apache Spark,以下是示例DF: Car Model Colors Toyota RAV4 Red, Black Toyota Camry Red, White (可以列出任意数量的颜色) 我如何将初始DF更改为此(无需更改) 复制每一行并按前两列上的匹配项计数) (其中数字表示每个汽车制造商有多少种特定颜色的车型) p.S以下是我对这个问题的看法: val folded = rdd .groupBy(_.manufacturer) .mapValues(_.fo
Car Model Colors
Toyota RAV4 Red, Black
Toyota Camry Red, White
(可以列出任意数量的颜色)
我如何将初始DF更改为此(无需更改)
复制每一行并按前两列上的匹配项计数)
(其中数字表示每个汽车制造商有多少种特定颜色的车型)
p.S以下是我对这个问题的看法:
val folded = rdd
.groupBy(_.manufacturer)
.mapValues(_.foldLeft(mutable.HashMap.empty[String, Long])((hm, el) => el.colors.foreach(color => hm(color) = hm(color) + 1)))
这给了我计数。虽然我不知道如何从“折叠的”中生成所需的DF 首先需要拆分“颜色”列,然后将其分解,最后按颜色和汽车分组, 尝试以下代码
scala> val initialDf = spark.createDataFrame(List(("Toyota","RAV4","Red,Black"),("Toyota","Camry","Red,White"))).toDF("Car","Model","Colors")
scala> initialDf.select($"Car",explode(split($"Colors",",")).as("Color")).groupBy($"Car",$"Color").agg(count($"Color").as("cnt")).show()
+------+-----+---+
| Car|Color|cnt|
+------+-----+---+
|Toyota|White| 1|
|Toyota| Red| 2|
|Toyota|Black| 1|
+------+-----+---+
scala> val initialDf = spark.createDataFrame(List(("Toyota","RAV4","Red,Black"),("Toyota","Camry","Red,White"))).toDF("Car","Model","Colors")
scala> initialDf.select($"Car",explode(split($"Colors",",")).as("Color")).groupBy($"Car",$"Color").agg(count($"Color").as("cnt")).show()
+------+-----+---+
| Car|Color|cnt|
+------+-----+---+
|Toyota|White| 1|
|Toyota| Red| 2|
|Toyota|Black| 1|
+------+-----+---+