Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/357.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 加权中值数据帧_Python_Pyspark_Pyspark Dataframes - Fatal编程技术网

Python 加权中值数据帧

Python 加权中值数据帧,python,pyspark,pyspark-dataframes,Python,Pyspark,Pyspark Dataframes,为了计算行加权中值,我编写了以下代码。我哪里出了问题,生成的值是空的?col_A是值,col_B是与这些值关联的权重 代码: def get_median(values,weights): return np.median(np.repeat(values,weights)) # function created to calculate wt. median wimedian = F.udf(get_median,DoubleType()) # registering a

为了计算行加权中值,我编写了以下代码。我哪里出了问题,生成的值是空的?col_A是值,col_B是与这些值关联的权重

代码:

def get_median(values,weights):
    return np.median(np.repeat(values,weights))    # function created to calculate wt. median

wimedian = F.udf(get_median,DoubleType())    # registering as udf here

myview = df.groupBy('category').agg(
    F.collect_list(F.col('col_A')),
    F.collect_list(F.col('col_B'))
).withColumn('Weighted_median',wimedian(F.col('col_A'),F.col('col_B')))

myview.show(3)
输出表:

+-----------+--------+-------+---------------+
|category   |col_A   |col_B  |Weighted_median|
+-----------+--------+-------+---------------+
|001        |[69]    |[8]    |null           |
|002        |[69]    |[14]   |null           |
|003        |[28, 21]|[3, 1] |null           |
+-----------+--------+-------+---------------+
仅供参考,本表第3行的正确输出应为[28,28,28,21]=28的中位数

这就是为什么
np.median
np.repeat
是用于的。

问题似乎是返回类型,因为dataframe不理解numpy类型,而且withColumn语句中的列引用不正确

我将类型转换为float,它现在正在运行

def get_median(values,weights):
    return float(np.median(np.repeat(values,weights)))

wimedian = F.udf(get_median,DoubleType())
df = sc.parallelize([["001",69,8],["002",69,14],["003",28,3],["003",21,1]]).toDF(["category","col_A","col_B"])

myview = df.groupBy('category').agg(
    F.collect_list(F.col('col_A')),
    F.collect_list(F.col('col_B'))).withColumn('Weighted_median',wimedian(F.col("collect_list(col_A)"),F.col("collect_list(col_B)"))).show()

您能否提供创建df的脚本。或者json也可以。它似乎暂时解决了这个问题。然而,如果我用np百分位数(q=50)替换np中位数,它又失败了。似乎返回类型不是这里唯一的问题。