Dataframe pyspark数据帧由多列同时排序

Dataframe pyspark数据帧由多列同时排序,dataframe,pyspark,sql-order-by,Dataframe,Pyspark,Sql Order By,我有一个包含一些数据的json文件,我将此json转换为pyspark dataframe(我选择了一些列而不是所有列),这是我的代码: import os from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import SparkSession import json from pyspark.sql.functions import col sc

我有一个包含一些数据的json文件,我将此json转换为pyspark dataframe(我选择了一些列而不是所有列),这是我的代码:

import os
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
import json
from pyspark.sql.functions import col

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
df=spark.read.json("/Users/deemaalomair/PycharmProj 
ects/first/deema.json").select('full_text', 
'retweet_count', 'favorite_count')
 c=df.count()
 print(c)
 df.orderBy(["retweet_count", "favorite_count"], ascending=[0, 0]).show(10)
这是输出:

                +--------------------+-------------+--------------+                             
                |           full_text|retweet_count|favorite_count|
                +--------------------+-------------+--------------+
                |Check out this in...|          388|           785|
                |Review – Apple Ai...|          337|           410|
                |This #iPhone atta...|          159|           243|
                |March is #Nationa...|          103|           133|
                |If you are trying to see the descending values in two columns simultaneously, that is not going to happen as each column has it's own separate order.

In the above data frame you can see that both the retweet_count and favorite_count has it's own order. This is the case with your data.

>>> import os
>>> from pyspark import SparkContext
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.sql import SparkSession
>>> sc = SparkContext.getOrCreate()
>>> spark = SparkSession(sc)
>>> df = spark.read.format('csv').option("header","true").load("/home/samba693/test.csv")
>>> df.show()
+---------+-------------+--------------+
|full_text|retweet_count|favorite_count|
+---------+-------------+--------------+
|      abc|           45|            45|
|      def|           50|            40|
|      ghi|           50|            39|
|      jkl|           50|            41|
+---------+-------------+--------------+

+-----------------------+--------------++
|全文|转发|收藏|计数|
+--------------------+-------------+--------------+
|请在…388 | 785中查看此文件|
|评论–苹果Ai…| 337 | 410|
|这个#iPhone atta…| 159 | 243|
|三月是#国家| 103 | 133|

| 如果您试图同时查看两列中的降序值,则不会发生这种情况,因为每列都有自己的独立顺序

在上面的数据框中,您可以看到retweet_计数和favorite_计数都有自己的顺序。你的数据就是这样

>>> df.orderBy(["retweet_count", "favorite_count"], ascending=False).show()
+---------+-------------+--------------+
|full_text|retweet_count|favorite_count|
+---------+-------------+--------------+
|      jkl|           50|            41|
|      def|           50|            40|
|      ghi|           50|            39|
|      abc|           45|            45|
+---------+-------------+--------------+

当我们基于两列应用order by时,实际发生的是,它基于第一列进行排序,如果存在平局,它将考虑第二列的值。但这可能不是你所看到的。您将看到如何根据两列的总和对它们进行排序

>>> from pyspark.sql.functions import expr
>>> df1 = df.withColumn('total',expr("retweet_count+favorite_count"))
>>> df1.show()
+---------+-------------+--------------+-----+
|full_text|retweet_count|favorite_count|total|
+---------+-------------+--------------+-----+
|      abc|           45|            45| 90.0|
|      def|           50|            40| 90.0|
|      ghi|           50|            39| 89.0|
|      jkl|           50|            41| 91.0|
+---------+-------------+--------------+-----+
解决此问题的一种方法是添加一个新列,该列和新列上的apply orderby之和为新列加上,并在排序后删除新列

>>> df2 = df1.orderBy("total", ascending=False)
>>> df2.show()
+---------+-------------+--------------+-----+
|full_text|retweet_count|favorite_count|total|
+---------+-------------+--------------+-----+
|      jkl|           50|            41| 91.0|
|      abc|           45|            45| 90.0|
|      def|           50|            40| 90.0|
|      ghi|           50|            39| 89.0|
+---------+-------------+--------------+-----+
>>> df = df2.select("full_text","retweet_count","favorite_count")
>>> df.show()
+---------+-------------+--------------+
|full_text|retweet_count|favorite_count|
+---------+-------------+--------------+
|      jkl|           50|            41|
|      abc|           45|            45|
|      def|           50|            40|
|      ghi|           50|            39|
+---------+-------------+--------------+

通过使用新列并在以后将其删除进行排序


希望这有帮助

@DeemahAlomair的可能重复项您是否在两列中寻找纯降序?然后你需要找到一个像sum这样的公共属性,就像我在上面做的那样!如果您试图找到一列的降序,那么您可以这样做!但我更感兴趣的是,为什么您要在两列中同时查找降序,因为这个问题可能还有其他解决方案!!你知道怎么打印全文吗@桑巴西瓦拉奥酒店