Pyspark SQL:在数据透视表中保留只有空值的条目
我试图在PySpark SQL数据框架上创建一个透视表,它不会删除空值。我的输入表具有以下结构: 我正在使用spark 2.1在Python2下运行IBM数据科学体验云中的所有内容 在熊猫数据帧上执行此操作时,“dropna=false”参数会给出我想要的结果Pyspark SQL:在数据透视表中保留只有空值的条目,sql,apache-spark,pyspark,pivot,Sql,Apache Spark,Pyspark,Pivot,我试图在PySpark SQL数据框架上创建一个透视表,它不会删除空值。我的输入表具有以下结构: 我正在使用spark 2.1在Python2下运行IBM数据科学体验云中的所有内容 在熊猫数据帧上执行此操作时,“dropna=false”参数会给出我想要的结果 table= pd.pivot_table(ratings,columns=['movieId'],index=[ 'monthyear','userId'], values='rating', dropna=False) 作为输出,
table= pd.pivot_table(ratings,columns=['movieId'],index=[ 'monthyear','userId'], values='rating', dropna=False)
作为输出,我得到以下结果:
+---------+------+----+----+----+----+
|monthyear|UserID| 30| 32| 40| 45|
+---------+------+----+----+----+----+
| 201002| 2|null|null| 4.0|null|
| 200912| 2|null|null|null|null|
| 200002| 2|null|null|null|null|
| 200912| 1| 2.5| 3.0|null|null|
| 200002| 1|null|null|null|null|
| 201002| 1|null|null|null|null|
| 200002| 3|null|null|null| 2.5|
| 200912| 3|null|null|null|null|
| 201002| 3|null|null|null|null|
+---------+------+----+----+----+----+
在PySpark SQL中,我目前正在使用以下命令:
ratings_pivot = spark_df.groupBy('monthyear','userId').pivot('movieId').sum("rating").show()
作为输出,我得到以下结果:
+---------+------+----+----+----+----+
|monthyear|UserID| 30| 32| 40| 45|
+---------+------+----+----+----+----+
| 201002| 2|null|null| 4.0|null|
| 200912| 2|null|null|null|null|
| 200002| 2|null|null|null|null|
| 200912| 1| 2.5| 3.0|null|null|
| 200002| 1|null|null|null|null|
| 201002| 1|null|null|null|null|
| 200002| 3|null|null|null| 2.5|
| 200912| 3|null|null|null|null|
| 201002| 3|null|null|null|null|
+---------+------+----+----+----+----+
如您所见,不显示所有仅具有空值的条目。是否有可能在SQL中使用类似于dropna=false的东西?由于这是非常具体的,我在互联网上找不到任何关于这方面的信息
我刚刚提取了一个小数据集进行复制:
df = spark.createDataFrame([("1", 30, 2.5,200912), ("1", 32, 3.0,200912), ("2", 40, 4.0,201002), ("3", 45, 2.5,200002)], ("userID", "movieID", "rating", "monthyear"))
df.show()
如果现在运行pivot查询,将得到以下结果:
df.groupBy("monthyear","UserID").pivot("movieID").sum("rating").show()
+---------+------+----+----+----+----+
|monthyear|UserID| 30| 32| 40| 45|
+---------+------+----+----+----+----+
| 201002| 2|null|null| 4.0|null|
| 200912| 1| 2.5| 3.0|null|null|
| 200002| 3|null|null|null| 2.5|
+---------+------+----+----+----+----+
我现在想要的是,结果如下所示:
+---------+------+----+----+----+----+
|monthyear|UserID| 30| 32| 40| 45|
+---------+------+----+----+----+----+
| 201002| 2|null|null| 4.0|null|
| 200912| 2|null|null|null|null|
| 200002| 2|null|null|null|null|
| 200912| 1| 2.5| 3.0|null|null|
| 200002| 1|null|null|null|null|
| 201002| 1|null|null|null|null|
| 200002| 3|null|null|null| 2.5|
| 200912| 3|null|null|null|null|
| 201002| 3|null|null|null|null|
+---------+------+----+----+----+----+
Spark会为行和列保留所有
null
值的条目:
火花2.1:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Python version 3.6.4 (default, Dec 21 2017 21:42:08)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([("a", 1, 4), ("a", 2, 2), ("b", 3, None), (None, 4, None)], ("x", "y", "z"))
In [2]: df.groupBy("x").pivot("y").sum("z").show()
+----+----+----+----+----+
| x| 1| 2| 3| 4|
+----+----+----+----+----+
|null|null|null|null|null|
| b|null|null|null|null|
| a| 4| 2|null|null|
+----+----+----+----+----+
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Python version 3.6.4 (default, Dec 21 2017 21:42:08)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([("a", 1, 4), ("a", 2, 2), ("b", 3, None), (None, 4, None)], ("x", "y", "z"))
In [2]: df.groupBy("x").pivot("y").sum("z").show()
+----+----+----+----+----+
| x| 1| 2| 3| 4|
+----+----+----+----+----+
|null|null|null|null|null|
| b|null|null|null|null|
| a| 4| 2|null|null|
+----+----+----+----+----+
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Python version 3.6.4 (default, Dec 21 2017 21:42:08)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([("a", 1, 4), ("a", 2, 2), ("b", 3, None), (None, 4, None)], ("x", "y", "z"))
In [2]: df.groupBy("x").pivot("y").sum("z").show()
+----+----+----+----+----+
| x| 1| 2| 3| 4|
+----+----+----+----+----+
|null|null|null|null|null|
| b|null|null|null|null|
| a| 4| 2|null|null|
+----+----+----+----+----+
火花2.2:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Python version 3.6.4 (default, Dec 21 2017 21:42:08)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([("a", 1, 4), ("a", 2, 2), ("b", 3, None), (None, 4, None)], ("x", "y", "z"))
In [2]: df.groupBy("x").pivot("y").sum("z").show()
+----+----+----+----+----+
| x| 1| 2| 3| 4|
+----+----+----+----+----+
|null|null|null|null|null|
| b|null|null|null|null|
| a| 4| 2|null|null|
+----+----+----+----+----+
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Python version 3.6.4 (default, Dec 21 2017 21:42:08)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([("a", 1, 4), ("a", 2, 2), ("b", 3, None), (None, 4, None)], ("x", "y", "z"))
In [2]: df.groupBy("x").pivot("y").sum("z").show()
+----+----+----+----+----+
| x| 1| 2| 3| 4|
+----+----+----+----+----+
|null|null|null|null|null|
| b|null|null|null|null|
| a| 4| 2|null|null|
+----+----+----+----+----+
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Python version 3.6.4 (default, Dec 21 2017 21:42:08)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([("a", 1, 4), ("a", 2, 2), ("b", 3, None), (None, 4, None)], ("x", "y", "z"))
In [2]: df.groupBy("x").pivot("y").sum("z").show()
+----+----+----+----+----+
| x| 1| 2| 3| 4|
+----+----+----+----+----+
|null|null|null|null|null|
| b|null|null|null|null|
| a| 4| 2|null|null|
+----+----+----+----+----+
火花2.3:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Python version 3.6.4 (default, Dec 21 2017 21:42:08)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([("a", 1, 4), ("a", 2, 2), ("b", 3, None), (None, 4, None)], ("x", "y", "z"))
In [2]: df.groupBy("x").pivot("y").sum("z").show()
+----+----+----+----+----+
| x| 1| 2| 3| 4|
+----+----+----+----+----+
|null|null|null|null|null|
| b|null|null|null|null|
| a| 4| 2|null|null|
+----+----+----+----+----+
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Python version 3.6.4 (default, Dec 21 2017 21:42:08)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([("a", 1, 4), ("a", 2, 2), ("b", 3, None), (None, 4, None)], ("x", "y", "z"))
In [2]: df.groupBy("x").pivot("y").sum("z").show()
+----+----+----+----+----+
| x| 1| 2| 3| 4|
+----+----+----+----+----+
|null|null|null|null|null|
| b|null|null|null|null|
| a| 4| 2|null|null|
+----+----+----+----+----+
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Python version 3.6.4 (default, Dec 21 2017 21:42:08)
SparkSession available as 'spark'.
In [1]: df = spark.createDataFrame([("a", 1, 4), ("a", 2, 2), ("b", 3, None), (None, 4, None)], ("x", "y", "z"))
In [2]: df.groupBy("x").pivot("y").sum("z").show()
+----+----+----+----+----+
| x| 1| 2| 3| 4|
+----+----+----+----+----+
|null|null|null|null|null|
| b|null|null|null|null|
| a| 4| 2|null|null|
+----+----+----+----+----+
Spark提供任何类似的功能,因为它无法扩展<代码>透视本身就足够昂贵了。可以使用外部联接手动执行此操作:
n=20#根据数据调整值
宽=(df
#获得独一无二的月份
.选择(“每月”)
.distinct()
.合并(n)#合并以避免分区号“爆炸”
#与上面的UserID和get Cartesian乘积相同
.crossJoin(df.select(“UserID”).distinct().coalesce(n))
#与数据透视连接
.加入(
df.groupBy(“monthyear”、“UserID”)
.pivot(“电影ID”)
.总额(“评级”),
[“monthyear”,“UserID”],
"左外"()
wide.show()
# +---------+------+----+----+----+----+
#| monthyear | UserID | 30 | 32 | 40 | 45|
# +---------+------+----+----+----+----+
#| 201002 | 3 |空|空|空|空|
#| 201002 | 2 |空|空| 4.0 |空|
#| 200002 | 1 |空|空|空|空|
#| 200912 | 1 | 2.5 | 3.0 |空|空|
#| 200002 | 3 |零|零|零| 2.5|
#| 200912 | 2 |空|空|空|空|
#| 200912 | 3 |空|空|空|空|
#| 201002 | 1 |空|空|空|空|
#| 200002 | 2 |空|空|空|空|
# +---------+------+----+----+----+----+
谢谢你的提示,我编辑了我的问题以使其重现。非常感谢你的回答。我使用的是Spark 2.1版。我在最后一部分对我的问题做了一些修改,使之更加清晰。非常好。谢谢!这正是我想要的。