Apache spark 如何计算和存储pyspark数据帧列中项目的频率?

Apache spark 如何计算和存储pyspark数据帧列中项目的频率?,apache-spark,pyspark,group-by,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Group By,Apache Spark Sql,Pyspark Dataframes,我有一个数据集 simpleData = [("person1","city1"), \ ("person1","city2"), \ ("person1","city1"), \ ("person1","city3"), \ ("person1","city1"),

我有一个数据集

simpleData = [("person1","city1"), \
    ("person1","city2"), \
    ("person1","city1"), \
    ("person1","city3"), \
    ("person1","city1"), \
    ("person2","city3"), \
    ("person2","city2"), \
    ("person2","city3"), \
    ("person2","city3") \
  ]
columns= ["persons_name","city_visited"]
exp = spark.createDataFrame(data = simpleData, schema = columns)

exp.printSchema()
exp.show()
看起来像这样-

root
 |-- persons_name: string (nullable = true)
 |-- city_visited: string (nullable = true)

+------------+------------+
|persons_name|city_visited|
+------------+------------+
|     person1|       city1|
|     person1|       city2|
|     person1|       city1|
|     person1|       city3|
|     person1|       city1|
|     person2|       city3|
|     person2|       city2|
|     person2|       city3|
|     person2|       city3|
+------------+------------+
现在,我想创建n个新列,其中n是名为“city_visited”的列中唯一项目的数量,这样它可以为所有人保留所有唯一项目的频率。 输出应该如下所示-

+------------+-----+-----+-----+
|persons_name|city1|city2|city3|
+------------+-----+-----+-----+
|     person1|    3|    1|    1|
|     person2|    0|    1|    3|
+------------+-----+-----+-----+

如何实现这一点?

pivot
groupBy
之后:

exp.groupBy('persons_name').pivot('city_visited').count()
如果希望0而不是
null

exp.groupBy('persons_name').pivot('city_visited').count().fillna(0)
如果要按
人名
排序,请在查询中附加
.orderBy('persons\u name')