Pyspark中的Percentile.INC

Pyspark中的Percentile.INC,pyspark,percentile,Pyspark,Percentile,我需要用Pyspark复制Excel的Percentile.INC功能。下面是我的输入数据帧(希望这是一个非常大的数据集) 我需要计算上述数据集上百分位1到99之间所有停止点的插值。 预期结果(从1到10取样) 我能够通过以下方式在python中使用numpy复制结果 # 1D array arr = [2.4, 3.17, 4.25] print("arr : ", arr) # print("1st percentile of arr : "

我需要用Pyspark复制Excel的Percentile.INC功能。下面是我的输入数据帧(希望这是一个非常大的数据集)

我需要计算上述数据集上百分位1到99之间所有停止点的插值。 预期结果(从1到10取样)

我能够通过以下方式在python中使用numpy复制结果

# 1D array  
arr = [2.4, 3.17, 4.25] 
print("arr : ", arr)  
# print("1st percentile of arr : ",  
#        np.percentile(arr, 1)) 
# print("25th percentile of arr : ", 
#        np.percentile(arr, 25)) 
# print("75th percentile of arr : ", 
#        np.percentile(arr, 75)) 

for i in range(1,99):
  print("%s percentile of arr : ",i,np.percentile(arr, i))
无法计算如何使用Pyspark计算相同的值。
提前感谢您的帮助。

检查这是否有帮助-

加载测试数据
val df=Seq(((“F1”、“I1”、2.4)、(“F2”、“I1”、3.17)、(“F3”、“I1”、4.25))
.toDF(“设施键”、“项目键”、“项目值”)
df.show(假)
df.printSchema()
/**
* +-----------+-------+---------+
*|设施键|项目键|项目值|
* +-----------+-------+---------+
*| F1 | I1 | 2.4|
*| F2 | I1 | 3.17|
*| F3 | I1 | 4.25|
* +-----------+-------+---------+
*
*根
*|--FacilityKey:string(nullable=true)
*|--ItemKey:string(nullable=true)
*|--ItemValue:double(nullable=false)
*/
计算四分位数范围的百分位数

df
.groupBy(“ItemKey”)
阿格先生(
expr(s“百分位(ItemValue,数组(${Range(1100).map(*0.01).mkString(“,”})))
.as(“百分位”))
.withColumn(“百分位”,explode($“百分位”))
.show(假)
/**
* +-------+------------------+
*| ItemKey |百分位|
* +-------+------------------+
*| I1 | 2.4154|
*| I1 | 2.4307999996|
*| I1 | 2.446199997|
*| I1 | 2.461600000000002|
*| I1 | 2.47700000000003|
*| I1 | 2.4924|
*| I1 | 2.5078|
*| I1 | 2.5232|
*| I1 | 2.538599997|
*| I1 | 2.554|
*| I1 | 2.5694|
*| I1 | 2.584799995|
*| I1 | 2.6002|
*| I1 | 2.6156|
*| I1 | 2.631|
*| I1 | 2.6464|
*| I1 | 2.6618|
*| I1 | 2.6772|
*| I1 | 2.692599997|
*| I1 | 2.708|
* +-------+------------------+
*仅显示前20行
*/
执行计划
df
.groupBy(“ItemKey”)
阿格先生(
expr(s“百分位(ItemValue,数组(${Range(1100).map(*0.01).mkString(“,”})))
.as(“百分位”))
.withColumn(“百分位”,explode($“百分位”))
.解释
/**
*==实际计划==
*生成爆炸(百分位数58),[ItemKey#8],false,[percentile#67]
*+-ObjectHashAggregate(键=[ItemKey#8],函数=[percentile(ItemValue#9,[0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.2,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.3,0.31,0.32,0.33,0.34,0.35000000000000003,0.36,0.37,0.38,0.39,0.4,0.41000000000000003,0.42,0.43,0.44,0.45,0.46,0.47000000000000003,0.48,0.49,0.5,0.51,0.52,0.53,0.54,0.55,0.56,0.5700000000000001,0.58,0.59,0.6,0.61,0.62,0.63,0.64,0.65,0.66,0.67,0.68,0.6900000000000001,0.7000000000000001,0.71,0.72,0.73,0.74,0.75,0.76,0.77,0.78,0.79,0.8,0.81,0.8200000000000001,0.8300000000000001,0.84,0.85,0.86,0.87,0.88,0.89,0.9,0.91,0.92,0.93,0.9400000000000001,0.9500000000000001,0.96,0.97,0.98,0.99], 1, 0, 0)])
*+-Exchange哈希分区(ItemKey#8,2)
*+-ObjectHashAggregate(键=[ItemKey#8],函数=[partial_percentile(ItemValue#9,[0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.2,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.3,0.31,0.32,0.33,0.34,0.35000000000000003,0.36,0.37,0.38,0.39,0.4,0.41000000000000003,0.42,0.43,0.44,0.45,0.46,0.47000000000000003,0.48,0.49,0.5,0.51,0.52,0.53,0.54,0.55,0.56,0.5700000000000001,0.58,0.59,0.6,0.61,0.62,0.63,0.64,0.65,0.66,0.67,0.68,0.6900000000000001,0.7000000000000001,0.71,0.72,0.73,0.74,0.75,0.76,0.77,0.78,0.79,0.8,0.81,0.8200000000000001,0.8300000000000001,0.84,0.85,0.86,0.87,0.88,0.89,0.9,0.91,0.92,0.93,0.9400000000000001,0.9500000000000001,0.96,0.97,0.98,0.99], 1, 0, 0)])
*+-LocalTableScan[ItemKey#8,ItemValue#9]
*/
为了更快的执行,您可能需要考虑<代码>近似百分位数< /代码>。
df
.groupBy(“ItemKey”)
阿格先生(
expr(s“大约百分位(ItemValue,数组(${Range(1100).map(*0.01).mkString(“,”}))”)
.as(“百分位”))
.withColumn(“百分位”,explode($“百分位”))
.show(假)
/**
* +-------+----------+
*| ItemKey |百分位|
* +-------+----------+
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
*| I1 | 2.4|
* +-------+----------+
*仅显示前20行
*/
如果您发现任何问题,请告诉我


不确定这是否是正确的方式``df=input_df.selectExpr('percentile(ItemValye,0.01))。show()“``这给出了百分位1的值。这在scala中有效。关于如何在pyspark中编写它,有什么建议吗?使用python
range
创建一个类似
1,2,3…99
的字符串,并将其作为变量放在
数组中
谢谢。我能够使它与pyspark一起工作。感谢您的帮助
| PercentileRank | Item | PerformanceScore |
|----------------|------|------------------|
| 1              | I1   | 2.4154           |
| 2              | I1   | 2.4308           |
| 3              | I1   | 2.4462           |
| 4              | I1   | 2.4616           |
| 5              | I1   | 2.477            |
| 6              | I1   | 2.4924           |
| 7              | I1   | 2.5078           |
| 8              | I1   | 2.5232           |
| 9              | I1   | 2.5386           |
| 10             | I1   | 2.554            |
| 11             | I1   | 2.5694           |
# 1D array  
arr = [2.4, 3.17, 4.25] 
print("arr : ", arr)  
# print("1st percentile of arr : ",  
#        np.percentile(arr, 1)) 
# print("25th percentile of arr : ", 
#        np.percentile(arr, 25)) 
# print("75th percentile of arr : ", 
#        np.percentile(arr, 75)) 

for i in range(1,99):
  print("%s percentile of arr : ",i,np.percentile(arr, i))