Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/344.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python PySpark中不同子数组类型元素的计数_Python_Apache Spark_Pyspark_Pyspark Sql - Fatal编程技术网

Python PySpark中不同子数组类型元素的计数

Python PySpark中不同子数组类型元素的计数,python,apache-spark,pyspark,pyspark-sql,Python,Apache Spark,Pyspark,Pyspark Sql,我有以下JSON结构: { "stuff": 1, "some_str": "srt", list_of_stuff": [ {"element_x":1, "element_y":"22x"}, {"element_x":3, "element_y":"23x"} ] }, { "stuff": 2, "some_str": "srt2", "list_of_stuff"

我有以下JSON结构:

{ 
   "stuff": 1, "some_str": "srt", list_of_stuff": [
                  {"element_x":1, "element_y":"22x"}, 
                  {"element_x":3, "element_y":"23x"}
                ]
}, 
{ 
   "stuff": 2, "some_str": "srt2", "list_of_stuff": [
                  {"element_x":1, "element_y":"22x"}, 
                  {"element_x":4, "element_y":"24x"}
                ]
}, 
当我将其作为json读入PySpark数据帧时:

import pyspark.sql
import json
from pyspark.sql import functions as F
from pyspark.sql.types import *

schema = StructType([
       StructField("stuff", IntegerType()),
       StructField("some_str", StringType()),
       StructField("list_of_stuff", ArrayType(
               StructType([
                   StructField("element_x", IntegerType()),
                   StructField("element_y", StringType()),
    ])
))
])


df = spark.read.json("hdfs:///path/file.json/*", schema=schema)
df.show()
我得到以下信息:

+--------+---------+-------------------+
| stuff  | some_str|    list_of_stuff  |
+--------+---------+-------------------+
|   1    |   srt   |  [1,22x], [3,23x] |
|   2    |   srt2  |  [1,22x], [4,24x] |
+--------+---------+-------------------+
似乎PySpark会展平ArrayType的键名,尽管我在执行
df.printSchema()
时仍然可以看到它们:

问题: 我需要计算数据帧中
元素_y
的不同出现次数。因此,给定示例JSON,我将得到以下输出:

22x:223x:1、24x:1


我不知道如何进入ArrayType并计算子元素
element_y
的不同值。感谢您的帮助。

一种方法是使用
rdd
flatMap
展平阵列,然后计数:

df.rdd.flatMap(lambda r: [x.element_y for x in r['list_of_stuff']]).countByValue()
# defaultdict(<class 'int'>, {'24x': 1, '22x': 2, '23x': 1})

一种方法是使用
rdd
flatMap
将数组展平,然后计数:

df.rdd.flatMap(lambda r: [x.element_y for x in r['list_of_stuff']]).countByValue()
# defaultdict(<class 'int'>, {'24x': 1, '22x': 2, '23x': 1})
import pyspark.sql.functions as F
(df.select(F.explode(df.list_of_stuff).alias('stuff'))
   .groupBy(F.col('stuff').element_y.alias('key'))
   .count()
   .show())
+---+-----+
|key|count|
+---+-----+
|24x|    1|
|22x|    2|
|23x|    1|
+---+-----+