Apache spark 以一个热向量的形式形成新列
我有一个数据帧:Apache spark 以一个热向量的形式形成新列,apache-spark,dataframe,Apache Spark,Dataframe,我有一个数据帧: customer | Department ---------------------- A | Food B | Home A | Office C | Home A | Home B | Office 客户列和部门列都是字符串类型 如何将不同类型的部门转换为新列(如一个热向量),以便创建如下所示的新数据框: customer | Food | Home | Off
customer | Department
----------------------
A | Food
B | Home
A | Office
C | Home
A | Home
B | Office
客户列和部门列都是字符串类型
如何将不同类型的部门转换为新列(如一个热向量),以便创建如下所示的新数据框:
customer | Food | Home | Office
-----------------------------------
A 1 1 1
B 0 1 1
C 0 1 0
这里的
Food
、Home
、Office
列为整数类型,customer
列为String
类型。您只需按类别
和透视
对数据进行分组,聚合为
val df = Seq(
("A", "Food"),
("B", "Home"),
("A", "Office"),
("C", "Home"),
("A", "Home"),
("B", "Office")
).toDF("customer", "department")
df.groupBy("customer").pivot("department").agg(count("department"))
.na.fill(0)
输出:
+--------+----+----+------+
|customer|Food|Home|Office|
+--------+----+----+------+
|B |0 |1 |1 |
|C |0 |1 |0 |
|A |1 |1 |1 |
+--------+----+----+------+
您只需按类别对数据进行分组
和透视
,聚合为
val df = Seq(
("A", "Food"),
("B", "Home"),
("A", "Office"),
("C", "Home"),
("A", "Home"),
("B", "Office")
).toDF("customer", "department")
df.groupBy("customer").pivot("department").agg(count("department"))
.na.fill(0)
输出:
+--------+----+----+------+
|customer|Food|Home|Office|
+--------+----+----+------+
|B |0 |1 |1 |
|C |0 |1 |0 |
|A |1 |1 |1 |
+--------+----+----+------+
这回答了你的问题吗?如果是这样,请接受作为答案,否则我相信你在进一步的问题上不会得到任何帮助。这是否回答了你的问题?如果是这样,请接受作为答案,否则我相信您将无法获得进一步问题的任何帮助。