Python 熊猫到Pypark cumprod函数
我正在尝试将下面的pandas代码转换为pyspark Python代码:Python 熊猫到Pypark cumprod函数,python,pandas,apache-spark,pyspark,apache-spark-sql,Python,Pandas,Apache Spark,Pyspark,Apache Spark Sql,我正在尝试将下面的pandas代码转换为pyspark Python代码: df = spark.createDataFrame([(1, 1,0.9), (1, 2,0.13), (1, 3,0.5), (1, 4,1.0), (1, 5,0.6)], ['col1', 'col2','col3']) pandas_df = df.toPandas() pandas_df['col4'] = (pandas_df.groupby(['col1','col2'])['col3'].apply(
df = spark.createDataFrame([(1, 1,0.9), (1, 2,0.13), (1, 3,0.5), (1, 4,1.0), (1, 5,0.6)], ['col1', 'col2','col3'])
pandas_df = df.toPandas()
pandas_df['col4'] = (pandas_df.groupby(['col1','col2'])['col3'].apply(lambda x: (1 - x).cumprod()))
pandas_df
from pyspark.sql import functions as F, Window, types
from functools import reduce
from operator import mul
df = spark.createDataFrame([(1, 1,0.9), (1, 2,0.13), (1, 3,0.5), (1, 4,1.0), (1, 5,0.6)], ['col1', 'col2','col3'])
partition_column = ['col1','col2']
window = Window.partitionBy(partition_column)
expr = 1.0 - F.col('col3')
mul_udf = F.udf(lambda x: reduce(mul, x), types.DoubleType())
df = df.withColumn('col4', mul_udf(F.collect_list(expr).over(window)))
df.orderBy('col2').show()
结果如下:
col1 col2 col3 col4
0 1 1 0.90 0.10
1 1 2 0.13 0.87
2 1 3 0.50 0.50
3 1 4 1.00 0.00
4 1 5 0.60 0.40
和转换后的火花代码:
df = spark.createDataFrame([(1, 1,0.9), (1, 2,0.13), (1, 3,0.5), (1, 4,1.0), (1, 5,0.6)], ['col1', 'col2','col3'])
pandas_df = df.toPandas()
pandas_df['col4'] = (pandas_df.groupby(['col1','col2'])['col3'].apply(lambda x: (1 - x).cumprod()))
pandas_df
from pyspark.sql import functions as F, Window, types
from functools import reduce
from operator import mul
df = spark.createDataFrame([(1, 1,0.9), (1, 2,0.13), (1, 3,0.5), (1, 4,1.0), (1, 5,0.6)], ['col1', 'col2','col3'])
partition_column = ['col1','col2']
window = Window.partitionBy(partition_column)
expr = 1.0 - F.col('col3')
mul_udf = F.udf(lambda x: reduce(mul, x), types.DoubleType())
df = df.withColumn('col4', mul_udf(F.collect_list(expr).over(window)))
df.orderBy('col2').show()
及其产出
+----+----+----+-------------------+
|col1|col2|col3| col4|
+----+----+----+-------------------+
| 1| 1| 0.9|0.09999999999999998|
| 1| 2|0.13| 0.87|
| 1| 3| 0.5| 0.5|
| 1| 4| 1.0| 0.0|
| 1| 5| 0.6| 0.4|
+----+----+----+-------------------+
我不完全了解pandas是如何工作的,有人能帮我验证一下上面的转换是否正确吗?我正在使用UDF,这会降低性能。pyspark中是否有任何分布式函数可以执行cumprod()
由于正数的乘积可以用
log
和exp
函数(a*b*c=exp(log(a)+log(b)+log(c))
表示,因此您可以仅使用Spark内置函数计算累积乘积:
df.groupBy("col1", "col2") \
.agg(max(col("col3")).alias("col3"),
coalesce(exp(sum(log(lit(1) - col("col3")))), lit(0)).alias("col4")
)\
.orderBy(col("col2"))\
.show()
对于spark 2.4+,您可以使用谢谢,这非常适合我的要求。我的数据相当庞大,每组大约有100万条记录。我将运行代码并查看其执行情况。