Apache spark 如何计算和存储pyspark数据帧列中项目的频率?
我有一个数据集Apache spark 如何计算和存储pyspark数据帧列中项目的频率?,apache-spark,pyspark,group-by,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Group By,Apache Spark Sql,Pyspark Dataframes,我有一个数据集 simpleData = [("person1","city1"), \ ("person1","city2"), \ ("person1","city1"), \ ("person1","city3"), \ ("person1","city1"),
simpleData = [("person1","city1"), \
("person1","city2"), \
("person1","city1"), \
("person1","city3"), \
("person1","city1"), \
("person2","city3"), \
("person2","city2"), \
("person2","city3"), \
("person2","city3") \
]
columns= ["persons_name","city_visited"]
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.printSchema()
exp.show()
看起来像这样-
root
|-- persons_name: string (nullable = true)
|-- city_visited: string (nullable = true)
+------------+------------+
|persons_name|city_visited|
+------------+------------+
| person1| city1|
| person1| city2|
| person1| city1|
| person1| city3|
| person1| city1|
| person2| city3|
| person2| city2|
| person2| city3|
| person2| city3|
+------------+------------+
现在,我想创建n个新列,其中n是名为“city_visited”的列中唯一项目的数量,这样它可以为所有人保留所有唯一项目的频率。
输出应该如下所示-
+------------+-----+-----+-----+
|persons_name|city1|city2|city3|
+------------+-----+-----+-----+
| person1| 3| 1| 1|
| person2| 0| 1| 3|
+------------+-----+-----+-----+
如何实现这一点?
pivot
在groupBy
之后:
exp.groupBy('persons_name').pivot('city_visited').count()
如果希望0而不是null
:
exp.groupBy('persons_name').pivot('city_visited').count().fillna(0)
如果要按人名
排序,请在查询中附加.orderBy('persons\u name')