Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/reporting-services/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pyspark 无法删除用于分解数组的列_Pyspark_Apache Spark Sql_Pyspark Dataframes - Fatal编程技术网

Pyspark 无法删除用于分解数组的列

Pyspark 无法删除用于分解数组的列,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,作为前言,我使用的是pyspark==2.4.5 当读取json文件时,可以在这里找到:,我需要分解数据,分解后我也不需要数据,也不需要统计数据 spark=SparkSession.builder.master('local[2]').appName(“createDataframe”)\ .getOrCreate() json\u data=spark.read.option('multiline',True).json(文件名) json_data=json_data.withColumn

作为前言,我使用的是
pyspark==2.4.5

当读取json文件时,可以在这里找到:,我需要分解数据,分解后我也不需要
数据
,也不需要
统计数据

spark=SparkSession.builder.master('local[2]').appName(“createDataframe”)\
.getOrCreate()
json\u data=spark.read.option('multiline',True).json(文件名)
json_data=json_data.withColumn(“数据值”,F.explode\u outer(“数据”))\
.drop(“数据”、“统计数据”)
下面您将看到模式和
json\u数据的前5行

根目录
|--数据值:struct(nullable=true)
||--date:string(nullable=true)
||--事件:数组(nullable=true)
|| |--元素:struct(containsnall=true)
|| | |--active:long(nullable=true)
|| | |--索引:long(nullable=true)
|| | |--模式:长(nullable=true)
|| | |--rate:long(nullable=true)
|| | |--timestamp:string(nullable=true)
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|数据单位值|
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|[2019-02-20,[[0,0,1,0,2019-02-20T00:00:00],[0,1,1,0,2019-02-20T00:01:00],[0,2,1,0,2019-02-20T00:02:00]]|
|[2019-02-21,[[1,0,1,0,2019-02-21T00:03:00],[0,1,1,0,2019-02-21T00:04:00],[1,2,1,1,2019-02-21T00:05:00],[1,3,1,1,2019-02-21T00:06:00]]|
|[2019-02-22,[[1,0,1,0,2019-02-22T00:03:00],[0,1,1,0,2019-02-22T00:04:00],[1,2,1,1,2019-02-22T00:05:00],[1,3,1,1,2019-02-22T00:06:00]]|
|[2019-02-23,[[1,3,1,1199-02-23T00:16:00]]|
|[2019-02-24,[[1,0,1,1199-02-24T00:03:00],[1,1,0,2019-02-24T00:04:00]]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
现在,为了获得我想要的数据,我执行下面的查询

newData=json\u数据\
.withColumn(“事件”,F.explode(json_data.data_values.events))\
.withColumn(“日期”,json_data.data_values.date)
newData.printSchema()
newData.show(3)
finalData=newData.drop(“数据值”)
最终数据显示(6)
在上面,您可以看到我正在创建一个名为
data\u values
的列,它分解传入的json数据。然后我创建列,从
数据值中提取事件和日期。下面您将看到模式的外观,以及前5行

根目录
|--数据值:struct(nullable=true)
||--date:string(nullable=true)
||--事件:数组(nullable=true)
|| |--元素:struct(containsnall=true)
|| | |--active:long(nullable=true)
|| | |--索引:long(nullable=true)
|| | |--模式:长(nullable=true)
|| | |--rate:long(nullable=true)
|| | |--timestamp:string(nullable=true)
|--事件:struct(nullable=true)
||--active:long(nullable=true)
||--索引:long(nullable=true)
||--模式:长(nullable=true)
||--rate:long(nullable=true)
||--timestamp:string(nullable=true)
|--日期:字符串(nullable=true)
+----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+----------+
|数据|值|事件|日期|
+----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+----------+
|[2019-02-20,[[0,0,1,0,2019-02-20T00:00:00],[0,1,1,0,2019-02-20T00:01:00],[0,2,1,0,2019-02-20T00:02:00]|[0,0,1,0,2019-02-20T00:00:00]|2019-02-20|
|[2019-02-20、[0,0,1,0,2019-02-20T00:00:00]、[0,1,1,0,2019-02-20T00:01:00]、[0,2,1,0,2019-02-20T00:02:00]|[0,1,1,0,2019-02-20T00:01:00]|2019-02-20|
|[2019-02-20、[0,0,1,0,2019-02-20T00:00:00]、[0,1,1,0,2019-02-20T00:01:00]、[0,2,1,0,2019-02-20T00:02:00]|[0,2,1,0,2019-02-20T00:02:00]|2019-02-20|
|[2019-02-21、[1,0,1,0,2019-02-21T00:03:00]、[0,1,1,0,2019-02-21T00:04:00]、[1,2,1,1,2019-02-21T00:05:00]、[1,3,1,119-02-21T00:06:00]|[1,0,1,0,2019-02-21T00:03:00]||
|[2019-02-21、[1,0,1,0,2019-02-21T00:03:00]、[0,1,1,0,2019-02-21T00:04:00]、[1,2,1,1,2019-02-21T00:05:00]、[1,3,1,1,2019-02-21T00:06:00]|[0,1,1,0,2019-02-21T00:04:00]|-02-21|
+----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+----------+
当我有我想要的数据帧时,我尝试删除
data\u值
,但我得到了以下错误:

py4j.protocol.Py4JJavaError:调用o58.showString时出错。
:org.apache.spark.sql.catalyst.errors.package$TreeNodeException:Binding属性,树:_gen_alias_25#25
>>> from pyspark.sql import functions as F
>>> json_data = spark.read.option('multiline', True).json("/home/maheshpersonal/stack.json")
>>> json_data.show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|data                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]], [2019-02-21, [[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]], [2019-02-22, [[1, 0, 1, 0, 2019-02-22T00:03:00], [0, 1, 1, 0, 2019-02-22T00:04:00], [1, 2, 1, 1, 2019-02-22T00:05:00], [1, 3, 1, 1, 2019-02-22T00:06:00]]], [2019-02-23, [[1, 3, 1, 1, 2019-02-23T00:16:00]]], [2019-02-24, [[1, 0, 1, 1, 2019-02-24T00:03:00], [1, 1, 1, 0, 2019-02-24T00:04:00]]]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>> json_data.printSchema()
root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- events: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- active: long (nullable = true)
 |    |    |    |    |-- index: long (nullable = true)
 |    |    |    |    |-- mode: long (nullable = true)
 |    |    |    |    |-- rate: long (nullable = true)
 |    |    |    |    |-- timestamp: string (nullable = true)
>>> json_data_1 = json_data.withColumn("data_values", F.explode_outer("data"))
>>> json_data_1.printSchema ()
root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- events: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- active: long (nullable = true)
 |    |    |    |    |-- index: long (nullable = true)
 |    |    |    |    |-- mode: long (nullable = true)
 |    |    |    |    |-- rate: long (nullable = true)
 |    |    |    |    |-- timestamp: string (nullable = true)
 |-- data_values: struct (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- events: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- active: long (nullable = true)
 |    |    |    |-- index: long (nullable = true)
 |    |    |    |-- mode: long (nullable = true)
 |    |    |    |-- rate: long (nullable = true)
 |    |    |    |-- timestamp: string (nullable = true)
 >>> newData = json_data_1.withColumn("events", json_data_1.data_values.events).withColumn("date", json_data_1.data_values.date)

 >>> newData.show()
    +--------------------+--------------------+--------------------+----------+
    |                data|         data_values|              events|      date|
    +--------------------+--------------------+--------------------+----------+
    |[[2019-02-20, [[0...|[2019-02-20, [[0,...|[[0, 0, 1, 0, 201...|2019-02-20|
    |[[2019-02-20, [[0...|[2019-02-21, [[1,...|[[1, 0, 1, 0, 201...|2019-02-21|
    |[[2019-02-20, [[0...|[2019-02-22, [[1,...|[[1, 0, 1, 0, 201...|2019-02-22|
    |[[2019-02-20, [[0...|[2019-02-23, [[1,...|[[1, 3, 1, 1, 201...|2019-02-23|
    |[[2019-02-20, [[0...|[2019-02-24, [[1,...|[[1, 0, 1, 1, 201...|2019-02-24|
    +--------------------+--------------------+--------------------+----------+
>>> newData_v1 = newData.drop(newData.data)
>>> newData_v1.show()
+--------------------+--------------------+----------+
|         data_values|              events|      date|
+--------------------+--------------------+----------+
|[2019-02-20, [[0,...|[[0, 0, 1, 0, 201...|2019-02-20|
|[2019-02-21, [[1,...|[[1, 0, 1, 0, 201...|2019-02-21|
|[2019-02-22, [[1,...|[[1, 0, 1, 0, 201...|2019-02-22|
|[2019-02-23, [[1,...|[[1, 3, 1, 1, 201...|2019-02-23|
|[2019-02-24, [[1,...|[[1, 0, 1, 1, 201...|2019-02-24|
+--------------------+--------------------+----------+
>>> finalDataframe = newData_v1.drop(newData_v1.data_values)
>>> finalDataframe.show(truncate = False)
+--------------------------------------------------------------------------------------------------------------------------------------------+----------+
|events                                                                                                                                      |date      |
+--------------------------------------------------------------------------------------------------------------------------------------------+----------+
|[[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]                                   |2019-02-20|
|[[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]|2019-02-21|
|[[1, 0, 1, 0, 2019-02-22T00:03:00], [0, 1, 1, 0, 2019-02-22T00:04:00], [1, 2, 1, 1, 2019-02-22T00:05:00], [1, 3, 1, 1, 2019-02-22T00:06:00]]|2019-02-22|
|[[1, 3, 1, 1, 2019-02-23T00:16:00]]                                                                                                         |2019-02-23|
|[[1, 0, 1, 1, 2019-02-24T00:03:00], [1, 1, 1, 0, 2019-02-24T00:04:00]]                                                                      |2019-02-24|
+--------------------------------------------------------------------------------------------------------------------------------------------+----------+