Python 使用不同类型(array[double]vs double)乘以两个pyspark dataframe列,而不使用breeze

Python 使用不同类型(array[double]vs double)乘以两个pyspark dataframe列,而不使用breeze,python,pyspark,pyspark-sql,pyspark-dataframes,Python,Pyspark,Pyspark Sql,Pyspark Dataframes,我有同样的问题问,但我需要一个解决方案在pyspark和无风 例如,如果我的pyspark数据框如下所示: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] df.wit

我有同样的问题问,但我需要一个解决方案在pyspark和无风

例如,如果我的pyspark数据框如下所示:

user    |  weight  |  vec
"u1"    | 0.1      | [2, 4, 6]
"u1"    | 0.5      | [4, 8, 12]
"u2"    | 0.5      | [20, 40, 60]
user    |  wsum
"u1"    | [2.2, 4.4, 6.6]
"u2"    | [10, 20, 30]
df.withColumn("weighted_vec", array(*[col("vec")[i] * col("weight") for i in range(n)])) \
  .groupBy("user").agg(array(*[sum(col("weighted_vec")[i]) for i in range(n)]).alias("wsum")) \
  .show()
当column weight的类型为double,column vec的类型为Array[double]时,我想得到每个用户向量的加权和,这样我得到的数据帧如下所示:

user    |  weight  |  vec
"u1"    | 0.1      | [2, 4, 6]
"u1"    | 0.5      | [4, 8, 12]
"u2"    | 0.5      | [20, 40, 60]
user    |  wsum
"u1"    | [2.2, 4.4, 6.6]
"u2"    | [10, 20, 30]
df.withColumn("weighted_vec", array(*[col("vec")[i] * col("weight") for i in range(n)])) \
  .groupBy("user").agg(array(*[sum(col("weighted_vec")[i]) for i in range(n)]).alias("wsum")) \
  .show()
为此,我尝试了以下方法:

df.groupBy('user').agg((F.sum(df.vec* df.weight)).alias("wsum"))
但它失败了,因为vec列和weight列的类型不同


如何在没有breeze的情况下解决此错误?

使用Spark 2.4提供的高阶函数
转换

# get size of vec array
n = df.select(size("vec")).first()[0]

# transform each element of the vec array
transform_expr = "transform(vec, x -> x * weight)"

df.withColumn("weighted_vec", expr(transform_expr)) \
  .groupBy("user").agg(array(*[sum(col("weighted_vec")[i]) for i in range(n)]).alias("wsum"))\
  .show()
给出:

+----+------------------+
|user|              wsum|
+----+------------------+
|  u1|   [2.2, 4.4, 6.6]|
|  u2|[10.0, 20.0, 30.0]|
+----+------------------+
对于Spark<2.4,使用a表示理解,将每个元素乘以
权重
列,如下所示:

user    |  weight  |  vec
"u1"    | 0.1      | [2, 4, 6]
"u1"    | 0.5      | [4, 8, 12]
"u2"    | 0.5      | [20, 40, 60]
user    |  wsum
"u1"    | [2.2, 4.4, 6.6]
"u2"    | [10, 20, 30]
df.withColumn("weighted_vec", array(*[col("vec")[i] * col("weight") for i in range(n)])) \
  .groupBy("user").agg(array(*[sum(col("weighted_vec")[i]) for i in range(n)]).alias("wsum")) \
  .show()

谢谢你的及时答复。它似乎做了些什么,但我当前的内存不足,出现了java.lang.OutOfMemoryError。不确定,但可能与第一个答案中讨论的序列化问题有关