Python Spark RDD.map在Spark数据框内使用withColumn方法_Python_Apache Spark_Pyspark

Python Spark RDD.map在Spark数据框内使用withColumn方法

python apache-spark pyspark

Python Spark RDD.map在Spark数据框内使用withColumn方法,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有以下代码： from pyspark import *; from pyspark.sql import *; from pyspark.sql.functions import udf from pyspark.sql.types import StringType, StructType, StructField, IntegerType, DoubleType import math; sc = SparkContext.getOrCreate(); spark = SparkSes

我有以下代码：

from pyspark import *;
from pyspark.sql import *;
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, StructType, StructField, IntegerType, DoubleType
import math;

sc = SparkContext.getOrCreate();
spark = SparkSession.builder.master('local').getOrCreate();


schema = StructType([
    StructField("INDEX", IntegerType(), True),
    StructField("SYMBOL", StringType(), True),
    StructField("DATETIMETS", StringType(), True),
    StructField("PRICE", DoubleType(), True),
    StructField("SIZE", IntegerType(), True),
])

df = spark\
    .createDataFrame(
        data=[(0,'A','2002-12-02 9:30:20',19.75,30200),
             (1,'A','2002-12-02 9:31:20',19.75,30200),             
             (8,'A','2004-12-02 10:36:20',1.0,30200),
             (9,'A','2006-12-02 22:41:20',20.0,30200),
             (10,'A','2006-12-02 22:42:20',40.0,30200)],
        schema=schema);

然后我不用spark做一些计算。这个很好用

def without_spark(price):    
    first_summation = sum(map(lambda n: math.sqrt(price), range(1,10)));
    return first_summation;

u_without_spark = udf(without_spark, DoubleType())

df.withColumn("NEW_COL", u_without_spark('PRICE')).show()

但是，下面使用rdd并行化的代码不支持

def with_spark(price):    
    rdd = sc.parallelize(1, 10)
    first_summation = rdd.map(lambda n: math.sqrt(price));
    return first_summation.sum();

u_with_spark = udf(with_spark, DoubleType())

df.withColumn("NEW_COL", u_with_spark('PRICE')).show()

我想做的是不可能的吗？有没有更快的方法

感谢您的帮助

您不能从UDF中调用任何RDD方法

创建自定义项时，它将在辅助项上运行。RDD或数据帧操作只能在驱动程序上运行，因此在UDF中不允许

似乎您的目标是执行UDAF（用户定义的聚合方法）。这不能从pyspark完成。你有两个选择。使用collect_list，然后对结果数组执行UDF，或者在scala中写入UDAF并将其包装为pyspark

然后我不用spark做一些计算

当您创建

dataframe

时，您使用了SparkSession，因此您已经在使用spark了

udf

和

withColumn

是spark dataframe的API，用于转换

dataframe

Dataframes

本质上是分布式的，即

Dataframes

上的所有转换都在工作节点中完成。因此，使用

withColumn

转换的udf
都是在工作节点上完成的。您在驱动程序节点中创建了不能在转换中使用的sparkContext
（sc
）
我想做的是不可能的吗？有没有更快的方法
您的第二个实现是错误的，因为您试图从转换中访问sparkContext

您的第一个方法似乎运行良好，并且已经在使用spark。所以我猜你不需要寻找替代品。< / P>如果答案是有用的，请考虑接受它。