如何获得pyspark中的百分比频率

如何获得pyspark中的百分比频率,pyspark,apache-spark-sql,pyspark-sql,Pyspark,Apache Spark Sql,Pyspark Sql,我试图在pyspark中获得百分比频率。我在python中这样做如下 Companies = df['Company'].value_counts(normalize = True) 获取频率相当简单: # Dates in descending order of complaint frequency df.createOrReplaceTempView('Comp') CompDF = spark.sql("SELECT Company, count(*) as cnt \

我试图在pyspark中获得百分比频率。我在python中这样做如下

Companies = df['Company'].value_counts(normalize = True)
获取频率相当简单:

# Dates in descending order of complaint frequency 
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
                    FROM Comp \
                    GROUP BY Company \
                    ORDER BY cnt DESC")
CompDF.show()
如何从这里获取百分比频率?我尝试了很多事情,但运气不太好。
任何帮助都将不胜感激。

可能正在修改SQL查询,它将为您提供所需的结果

    "SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt 
    FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from 
    (SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt 
    DESC)"

正如Suresh在评论中所暗示的,假设
total\u count
是dataframe
companys
中的行数,您可以使用
with column
CompDF
中添加一个名为
percentages
的新列:

total_count = Companies.count()

df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))

用总计数来计算百分比怎么样?这看起来非常简洁明了。谢谢
total_count = Companies.count()

df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))