Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/amazon-s3/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
洗牌数据透视PySpark数据帧行引发NullPointedException_Pyspark_Apache Spark Sql - Fatal编程技术网

洗牌数据透视PySpark数据帧行引发NullPointedException

洗牌数据透视PySpark数据帧行引发NullPointedException,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,假设下一个PySpark数据帧: +-------+----+---+---+----+ |user_id|type| d1| d2| d3| +-------+----+---+---+----+ | c1| A|3.4|0.4| 3.5| | c1| B|9.6|0.0| 0.0| | c1| A|2.8|0.4| 0.3| | c1| B|5.4|0.2|0.11| | c2| A|0.0|9.7| 0.3| | c2|

假设下一个PySpark数据帧:

+-------+----+---+---+----+
|user_id|type| d1| d2|  d3|
+-------+----+---+---+----+
|     c1|   A|3.4|0.4| 3.5|
|     c1|   B|9.6|0.0| 0.0|
|     c1|   A|2.8|0.4| 0.3|
|     c1|   B|5.4|0.2|0.11|
|     c2|   A|0.0|9.7| 0.3|
|     c2|   B|9.6|8.6| 0.1|
|     c2|   A|7.3|9.1| 7.0|
|     c2|   B|0.7|6.4| 4.3|
+-------+----+---+---+----+
创建时使用:

df = sc.parallelize([
    ("c1", "A", 3.4, 0.4, 3.5), 
    ("c1", "B", 9.6, 0.0, 0.0),
    ("c1", "A", 2.8, 0.4, 0.3),
    ("c1", "B", 5.4, 0.2, 0.11),
    ("c2", "A", 0.0, 9.7, 0.3), 
    ("c2", "B", 9.6, 8.6, 0.1),
    ("c2", "A", 7.3, 9.1, 7.0),
    ("c2", "B", 0.7, 6.4, 4.3)
]).toDF(["user_id", "type", "d1", "d2", "d3"])
df.show()
然后,通过
user\u id
旋转它以获得:

data_wide = df.groupBy('user_id')\
.pivot('type')\
.agg(*[f.sum(x).alias(x) for x in df.columns if x not in {"user_id", "type"}])

data_wide.show()

+-------+-----------------+------------------+----+------------------+----+------------------+
|user_id|             A_d1|              A_d2|A_d3|              B_d1|B_d2|              B_d3|
+-------+-----------------+------------------+----+------------------+----+------------------+
|     c1|6.199999999999999|               0.8| 3.8|              15.0| 0.2|              0.11|
|     c2|              7.3|18.799999999999997| 7.3|10.299999999999999|15.0|4.3999999999999995|
+-------+-----------------+------------------+----+------------------+----+------------------+
现在,我想将其行顺序随机化:

data_wide = data_wide.orderBy(f.rand())
data_wide.show()
但最后一步抛出一个
NullPointedException

Py4JJavaError: An error occurred while calling o101.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 161 in stage 27.0 failed 20 times, most recent failure: Lost task 161.19 in stage 27.0 (TID 1300, 192.168.192.57, executor 1): java.lang.NullPointerException
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_3$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:232) 
但是,如果在
orderBy(f.rand())
步骤之前缓存了宽df,则最后一步可以正常工作:

data_wide.cache()
data_wide = data_wide.orderBy(f.rand())
data_wide.show()

+-------+-----------------+------------------+----+------------------+----+------------------+
|user_id|             A_d1|              A_d2|A_d3|              B_d1|B_d2|              B_d3|
+-------+-----------------+------------------+----+------------------+----+------------------+
|     c2|              7.3|18.799999999999997| 7.3|10.299999999999999|15.0|4.3999999999999995|
|     c1|6.199999999999999|               0.8| 3.8|              15.0| 0.2|              0.11|
+-------+-----------------+------------------+----+------------------+----+------------------+ 
这里有什么问题?看起来,在
orderBy
步骤中,pivot没有生效,也没有正确地计划执行,但我不知道实际的问题是什么。有什么想法吗

Spark版本是2.1.0,python版本是3.5.2


提前感谢您

这在2.3.1版上对我有效,无需缓存。。。您使用的python和pyspark的版本是什么?@coldspeed对不起,spark的版本是2.1.0和Python3.5.2。这似乎是一个版本问题。我已经编辑了这个问题,谢谢。一个简单的更新能帮你解决吗?在spark 2.3.0中对我有效是的!一个简单的更新修复它,谢谢