Python 将udf应用于多个列并使用numpy操作_Python_Numpy_Apache Spark_Pyspark_Apache Spark Sql

Python 将udf应用于多个列并使用numpy操作

python numpy apache-spark pyspark

Python 将udf应用于多个列并使用numpy操作,python,numpy,apache-spark,pyspark,apache-spark-sql,Python,Numpy,Apache Spark,Pyspark,Apache Spark Sql,我在pyspark中有一个名为result的数据帧，我想应用一个udf来创建一个新列，如下所示： result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)]).withColumnRenamed("_1","count").withColumnRenamed("_2","df").withColumnRenamed("_3","docs") @udf("float

我在pyspark中有一个名为result的数据帧，我想应用一个udf来创建一个新列，如下所示：

result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)]).withColumnRenamed("_1","count").withColumnRenamed("_2","df").withColumnRenamed("_3","docs")
@udf("float")
def newFunction(arr):
    return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])

result=result.withColumn("new_function_result",newFunction_udf(array("count","df","docs")))

列计数、df、docs都是整数列，但返回

Py4JError:调用时出错 z:org.apache.spark.sql.functions.col。跟踪：py4j.Py4JException: 上不存在方法col（[class java.util.ArrayList]） py4j.reflection.ReflectionEngine.getMethod（ReflectionEngine.java:318）在 py4j.reflection.ReflectionEngine.getMethod（ReflectionEngine.java:339）在py4j.Gateway.invoke（Gateway.java:274）处 py4j.commands.AbstractCommand.invokeMethod（AbstractCommand.java:132）在py4j.commands.CallCommand.execute（CallCommand.java:79）处在上运行（GatewayConnection.java:214） run（Thread.java:748）

当我试着通过一根柱子，得到这些柱子的正方形时，效果很好

非常感谢您的帮助。

错误消息具有误导性，但它试图告诉您函数不返回浮点值。您的函数返回类型为

numpy.float64

的值，您可以使用VectorUDT类型获取该值（以下示例中的函数：

newFunctionVector

）。使用numpy的另一种方法是将numpy类型

numpy.float64

强制转换为python类型float（以下示例中的函数：

newFunctionWithArray

）

最后但并非最不重要的一点是，不需要调用，因为UDF可以使用多个参数（在下面的示例中，函数：

newFunction

）

将numpy导入为np
从pyspark.sql.functions导入udf、数组
从pyspark.sql.types导入FloatType
从pyspark.mllib.linalg导入向量，VectorUDT
result=sqlContext.createDataFrame（[（138,5,10），（128,4,10），（112,3,10），（120,3,10），（189,1,10）]，[“计数”，“测向”，“文档”]）
def新功能向量（arr）：
返回（1+np.log（arr[0]）*np.log（arr[2]/arr[1]）
@自定义项（“浮动”）
def newFunctionWithArray（arr）：
returnValue=（1+np.log（arr[0]）*np.log（arr[2]/arr[1]）
返回returnValue.item（）
@自定义项（“浮动”）
def新功能（计数、df、单据）：
returnValue=（1+np.log（count））*np.log（docs/df）
返回returnValue.item（）
vector_udf=udf（newFunctionVector，VectorUDT（））
结果=结果。带列（“新函数\结果”，新函数（“计数”，“df”，“文档”））
结果=结果.withColumn（“新函数”\u结果”\u WithArray）、新函数WithArray（数组（“计数”、“df”、“文档”））
结果=结果.withColumn（“新函数”\u结果”\u向量），newFunctionWithArray（数组（“计数”、“df”、“文档”））
result.printSchema（）
result.show（）

输出：

root 
|-- count: long (nullable = true) 
|-- df: long (nullable = true) 
|-- docs: long (nullable = true) 
|-- new_function_result: float (nullable = true) 
|-- new_function_result_WithArray: float (nullable = true) 
|-- new_function_result_Vector: float (nullable = true)

+-----+---+----+-------------------+-----------------------------+--------------------------+ 
|count| df|docs|new_function_result|new_function_result_WithArray|new_function_result_Vector|
+-----+---+----+-------------------+-----------------------------+--------------------------+ 
|  138|  5|  10|           4.108459|                     4.108459|                  4.108459| 
|  128|  4|  10|           5.362161|                     5.362161|                  5.362161|
|  112|  3|  10|          6.8849173|                    6.8849173|                 6.8849173|
|  120|  3|  10|           6.967983|                     6.967983|                  6.967983|
|  189|  1|  10|          14.372153|                    14.372153|                 14.372153|  
+-----+---+----+-------------------+-----------------------------+--------------------------+

请给我们一个提示，并向我们显示完整的错误消息。@cronoik Edited抱歉，您的createDataframe函数抛出了一个错误。不应该是

sqlContext.createDataFrame（[（138,5,10），（128,4,10），（112,3,10），（120,3,10），（189,1,10）]）

？很抱歉给您带来不便