计数PySpark中数组包含每个类别的字符串的次数

计数PySpark中数组包含每个类别的字符串的次数,pyspark,Pyspark,我从火花阵列“df_spark”开始: 我想以一个新的spark表“df_results_spark”结束,它统计数组中每个类别“红、蓝、橙”中字符串“cat”、“monkey”、“dog”的出现次数 可以使用explode()函数为数组中的每个元素创建一行 df_spark_exploded = df_spark.selectExpr("color","explode(animal) as animal") df_spark_exploded.show() +------+------+ |

我从火花阵列“df_spark”开始:

我想以一个新的spark表“df_results_spark”结束,它统计数组中每个类别“红、蓝、橙”中字符串“cat”、“monkey”、“dog”的出现次数


可以使用
explode()
函数为数组中的每个元素创建一行

df_spark_exploded = df_spark.selectExpr("color","explode(animal) as animal")
df_spark_exploded.show()

+------+------+
| color|animal|
+------+------+
|  blue|   cat|
|  blue|   dog|
|orange|   cat|
|orange|monkey|
|  blue|monkey|
|  blue|   cat|
|orange|   dog|
|orange|monkey|
|orange|   cat|
|orange|   dog|
|   red|monkey|
|   red|   dog|
+------+------+
然后使用
pivot()
重塑数据帧,并应用count聚合函数获取每只动物的计数

df_results_spark = df_spark_exploded.groupby("color").pivot("animal").count().fillna(0)
df_results_spark.show()

+------+---+---+------+
| color|cat|dog|monkey|
+------+---+---+------+
|orange|  2|  2|     2|
|   red|  0|  1|     1|
|  blue|  2|  1|     1|
+------+---+---+------+
df_spark_exploded = df_spark.selectExpr("color","explode(animal) as animal")
df_spark_exploded.show()

+------+------+
| color|animal|
+------+------+
|  blue|   cat|
|  blue|   dog|
|orange|   cat|
|orange|monkey|
|  blue|monkey|
|  blue|   cat|
|orange|   dog|
|orange|monkey|
|orange|   cat|
|orange|   dog|
|   red|monkey|
|   red|   dog|
+------+------+
df_results_spark = df_spark_exploded.groupby("color").pivot("animal").count().fillna(0)
df_results_spark.show()

+------+---+---+------+
| color|cat|dog|monkey|
+------+---+---+------+
|orange|  2|  2|     2|
|   red|  0|  1|     1|
|  blue|  2|  1|     1|
+------+---+---+------+