计数PySpark中数组包含每个类别的字符串的次数
我从火花阵列“df_spark”开始: 我想以一个新的spark表“df_results_spark”结束,它统计数组中每个类别“红、蓝、橙”中字符串“cat”、“monkey”、“dog”的出现次数计数PySpark中数组包含每个类别的字符串的次数,pyspark,Pyspark,我从火花阵列“df_spark”开始: 我想以一个新的spark表“df_results_spark”结束,它统计数组中每个类别“红、蓝、橙”中字符串“cat”、“monkey”、“dog”的出现次数 可以使用explode()函数为数组中的每个元素创建一行 df_spark_exploded = df_spark.selectExpr("color","explode(animal) as animal") df_spark_exploded.show() +------+------+ |
可以使用
explode()
函数为数组中的每个元素创建一行
df_spark_exploded = df_spark.selectExpr("color","explode(animal) as animal")
df_spark_exploded.show()
+------+------+
| color|animal|
+------+------+
| blue| cat|
| blue| dog|
|orange| cat|
|orange|monkey|
| blue|monkey|
| blue| cat|
|orange| dog|
|orange|monkey|
|orange| cat|
|orange| dog|
| red|monkey|
| red| dog|
+------+------+
然后使用pivot()
重塑数据帧,并应用count聚合函数获取每只动物的计数
df_results_spark = df_spark_exploded.groupby("color").pivot("animal").count().fillna(0)
df_results_spark.show()
+------+---+---+------+
| color|cat|dog|monkey|
+------+---+---+------+
|orange| 2| 2| 2|
| red| 0| 1| 1|
| blue| 2| 1| 1|
+------+---+---+------+
df_spark_exploded = df_spark.selectExpr("color","explode(animal) as animal")
df_spark_exploded.show()
+------+------+
| color|animal|
+------+------+
| blue| cat|
| blue| dog|
|orange| cat|
|orange|monkey|
| blue|monkey|
| blue| cat|
|orange| dog|
|orange|monkey|
|orange| cat|
|orange| dog|
| red|monkey|
| red| dog|
+------+------+
df_results_spark = df_spark_exploded.groupby("color").pivot("animal").count().fillna(0)
df_results_spark.show()
+------+---+---+------+
| color|cat|dog|monkey|
+------+---+---+------+
|orange| 2| 2| 2|
| red| 0| 1| 1|
| blue| 2| 1| 1|
+------+---+---+------+