Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/327.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python PySpark中加权平均数的计算_Python_Apache Spark_Pyspark - Fatal编程技术网

Python PySpark中加权平均数的计算

Python PySpark中加权平均数的计算,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我试图在pyspark中计算加权平均值,但没有取得很大进展 # Example data df = sc.parallelize([ ("a", 7, 1), ("a", 5, 2), ("a", 4, 3), ("b", 2, 2), ("b", 5, 4), ("c", 1, -1) ]).toDF(["k", "v1", "v2"]) df.show() import numpy as np def weighted_mean(workclass, final_weigh

我试图在pyspark中计算加权平均值,但没有取得很大进展

# Example data
df = sc.parallelize([
    ("a", 7, 1), ("a", 5, 2), ("a", 4, 3),
    ("b", 2, 2), ("b", 5, 4), ("c", 1, -1)
]).toDF(["k", "v1", "v2"])
df.show()

import numpy as np
def weighted_mean(workclass, final_weight):
    return np.average(workclass, weights=final_weight)

weighted_mean_udaf = pyspark.sql.functions.udf(weighted_mean,
    pyspark.sql.types.IntegerType())
但是当我尝试执行这段代码时

df.groupby('k').agg(weighted_mean_udaf(df.v1,df.v2)).show()
我发现了错误

u"expression 'pythonUDF' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get
我的问题是,我可以指定一个自定义函数(使用多个参数)作为agg的参数吗?如果没有,在按键分组后是否有其他方法来执行加权平均值之类的操作?

用户定义的聚合函数(UDAF,适用于
pyspark.sql.GroupedData,但在pyspark中不受支持)与用户定义函数(UDF,适用于
pyspark.sql.DataFrame
)不同

因为在pyspark中,您无法创建自己的UDAF,并且提供的UDAF无法解决您的问题,所以您可能需要返回RDD world:

from numpy import sum

def weighted_mean(vals):
    vals = list(vals)  # save the values from the iterator
    sum_of_weights = sum(tup[1] for tup in vals)
    return sum(1. * tup[0] * tup[1] / sum_of_weights for tup in vals)

df.rdd.map(
    lambda x: (x[0], tuple(x[1:]))  # reshape to (key, val) so grouping could work
).groupByKey().mapValues(
    weighted_mean
).collect()

你的意思是要覆盖
weighted_mean
函数吗?我想做的是a)groupby b)根据数据帧的多列执行一个操作。加权平均数只是一个例子。我想@cricket_007的意思是,你是否故意用这行来覆盖
Weighted_-mean
Weighted_-mean=pyspark.sql.functions.udf(我的意思是,
或者它是一个输入错误?我不认为该函数采用您所使用的参数类型giving@cricket_007就在这里。Agg只接受适当的UDAFs。请看一个例子。虽然没有Python API。对于像这样的小情况,您所需要的只是一个简单的公式,所以它看起来像是一个相当人工的问题。感谢@ijoseph指出
map
df.rdd
上起作用。在撰写本文时,我习惯于直接调用
df.map
。我不确定它是否仍然有效,但最好是明确的。我相信,在Spark 2.4.3中,如果没有
,它会为我抛出一个错误(在Spark 2.4.3中)。