Python 如何在PySpark 1.6中将数据帧列从字符串转换为float/double？_Python_Pyspark_Apache Spark Sql_Type Conversion

Python 如何在PySpark 1.6中将数据帧列从字符串转换为float/double？

python pyspark

Python 如何在PySpark 1.6中将数据帧列从字符串转换为float/double？,python,pyspark,apache-spark-sql,type-conversion,Python,Pyspark,Apache Spark Sql,Type Conversion,在PySpark 1.6数据帧中，目前没有Spark内置函数将字符串转换为float/double 假设我们有一个带有（'house_name'，'price'）的RDD，两个值都是字符串。您希望将价格从字符串转换为浮动。在PySpark中，我们可以应用map和python float函数来实现这一点 New_RDD = RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price)) # this works

在PySpark 1.6数据帧中，目前没有Spark内置函数将字符串转换为float/double

假设我们有一个带有（'house_name'，'price'）的RDD，两个值都是字符串。您希望将价格从字符串转换为浮动。在PySpark中，我们可以应用map和python float函数来实现这一点

New_RDD = RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price))    # this works

在PySpark 1.6数据帧中，它不工作：

New_DF = rawdataDF.select('house name', float('price')) # did not work

在内置Pyspark函数可用之前，如何使用UDF实现此转换？我开发了此转换UDF，如下所示：

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def string_to_float(x):
    return float(x)

udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("house name", udfstring_to_float("price"))

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: string (nullable = true)

>>> rawdata=rawdata.withColumn('price',rawdata['price'].cast("float").alias('price'))

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: float (nullable = true)

有没有更好、更简单的方法来实现同样的效果？

根据，您可以对以下列使用

cast

功能：

rawdata.withColumn("house name", rawdata["price"].cast(DoubleType()).alias("price"))

答案如下：

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def string_to_float(x):
    return float(x)

udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("house name", udfstring_to_float("price"))

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: string (nullable = true)

>>> rawdata=rawdata.withColumn('price',rawdata['price'].cast("float").alias('price'))

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: float (nullable = true)

因为它是不使用任何用户定义函数的最短单行代码。您可以使用

printSchema（）

函数查看它是否正常工作。

这对我@Jaco不起作用。OP说他正在使用pyspark 1.6，而您链接到的文档是1.3。当我在1.6上尝试此操作时，我得到了

AttributeError:“DoubleType”对象没有属性“alias”

您是否从pyspark.sql.types导入了

？我确信在发布之前我已经在PySpark 1.6上测试过了。FIX:应该是rawdata.withColumn（“house name”，rawdata[“price”]）。cast（DoubleType（））。alias（“price”）