Apache spark Python的Spark-can'；t将字符串列强制转换为十进制/双精度_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql

Apache spark Python的Spark-can'；t将字符串列强制转换为十进制/双精度

apache-spark pyspark

Apache spark Python的Spark-can'；t将字符串列强制转换为十进制/双精度,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,在为这次行动发布的所有问题中，我找不到有效的东西我尝试了几个版本，在所有版本中，我都有这个DataFrame： dataFrame=spark.read.format（“com.mongodb.spark.sql”）.load（） dataFrame.printSchema（）的打印输出为根目录 |--SensorId:字符串（nullable=true） |--\u id:struct（nullable=true） ||--oid:字符串（nullable=true） |--\类型：字符

在为这次行动发布的所有问题中，我找不到有效的东西

我尝试了几个版本，在所有版本中，我都有这个

DataFrame

：

dataFrame=spark.read.format（“com.mongodb.spark.sql”）.load（）

dataFrame.printSchema（）

的打印输出为

根目录
|--SensorId:字符串（nullable=true）
|--\u id:struct（nullable=true）
||--oid:字符串（nullable=true）
|--\类型：字符串（nullable=true）
|--设备：字符串（nullable=true）
|--deviceType:字符串（nullable=true）
|--事件id:string（nullable=true）
|--gen_val:string（nullable=true）
|--车道id:字符串（nullable=true）
|--系统id:string（nullable=true）
|--时间：字符串（nullable=true）

创建数据帧后，我想将列

'gen_val'

（存储在变量

results.inputColumns

）从

String

类型转换为

Double

类型。不同的版本导致不同的错误

版本#1

代码：

dataFrame=dataFrame.withColumn（results.inputColumns，dataFrame[results.inputColumns].cast（'double'））

相反，使用

cast（DoubleType（））

将生成相同的错误

错误：

AttributeError:'DataFrame'对象没有属性'cast'

版本#2

代码：

dataFrame=dataFrame.withColumn（results.inputColumns，dataFrame['gen_val'].cast（'double'））

即使这个选项不是真的相关，因为参数不能硬编码

错误：

dataFrame=dataFrame.withColumn（results.inputColumns，dataFrame['gen_val'].cast（'double'））
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py”，第1502行，在withColumn中
文件“/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”，第1133行，在__
文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py”，第63行，deco格式
文件“/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”，第323行，在get_return_值中
py4j.protocol.Py4JError:调用o31.withColumn时出错。跟踪：
py4j.Py4JException:withColumn（[class java.util.ArrayList，class org.apache.spark.sql.Column]）的方法不存在
位于py4j.reflection.ReflectionEngine.getMethod（ReflectionEngine.java:318）
位于py4j.reflection.ReflectionEngine.getMethod（ReflectionEngine.java:326）
在py4j.Gateway.invoke处（Gateway.java:272）
位于py4j.commands.AbstractCommand.invokeMethod（AbstractCommand.java:132）
在py4j.commands.CallCommand.execute（CallCommand.java:79）
在py4j.GatewayConnection.run处（GatewayConnection.java:214）
运行（Thread.java:748）

感谢您的帮助

我尝试了其他方法，效果很好-我没有更改输入列数据，而是创建了一个铸造/转换列。我认为效率较低，但这正是我目前所拥有的

dataFrame=spark.read.format（“com.mongodb.spark.sql”）.load（）
col=dataFrame.gen_val.cast（'double'））
dataFrame=dataFrame.withColumn（'double'，col.cast（'double'））
汇编程序=矢量汇编程序（inputCols=[“Double”]，outputCol=“features”）
输出=汇编程序.transform（数据帧）

对于张彤：这是

dataFrame.printSchema（）

的打印输出：

根目录
|--SensorId:字符串（nullable=true）
|--\u id:struct（nullable=true）
||--oid:字符串（nullable=true）
|--\类型：字符串（nullable=true）
|--设备：字符串（nullable=true）
|--deviceType:字符串（nullable=true）
|--事件id:string（nullable=true）
|--gen_val:string（nullable=true）
|--车道id:字符串（nullable=true）
|--系统id:string（nullable=true）
|--时间：字符串（nullable=true）

无论如何，这是一个非常基本的转换，在不久的将来，我将需要做更复杂的转换。如果你们中有人知道使用spark和Python进行数据帧转换的好例子、说明或文档，我将不胜感激。

不太清楚你们想做什么；

withColumn

的第一个参数应该是一个数据帧列名，既可以是一个现有列名（要修改）也可以是一个新列名（要创建），而（至少在您的版本1中）您可以将其当作

结果来使用。InputColumn

已经是一个列（不是）

在任何情况下，将字符串转换为双重类型都是严格向前的；以下是一个玩具示例：

spark.version
#u'2.2.0'
从pyspark.sql.types导入DoubleType
df=spark.createDataFrame（[（“foo”，“1”），（“bar”，“2”）]，schema=[“A”，“B']））
df
#数据帧[A:字符串，B:字符串]
df.show（）
# +---+---+ 
#| A | B|
# +---+---+
#|foo|1|
#|杆| 2|
# +---+---+
df2=df.withColumn（'B'，df['B'].cast（'double'））
df2.show（）
# +---+---+ 
#| A | B|
# +---+---+
#| foo | 1.0 |
#|巴| 2.0|
# +---+---+
df2
#数据帧[A:字符串，B:双精度]

在您的情况下，这应该可以完成以下工作：

从pyspark.sql.types导入DoubleType
new_df=dataframe.withColumn（'gen_val'，dataframe['gen_val']）。cast（'double'））

请显示dataFrame.show（）或dataFrame.printSchema（），以及结果是什么。InputColumn此处的所有错误都是由“inputColumns”作为列表引起的。在案例#1中，如果您传递一个类似dataFrame[list]的列表，它将返回一个新的dataFrame对象，其中包含您指定的列。数据帧没有“cast”函数，因此出现错误。如果您传递一个字符串，比如dataFrame[str]，它将返回一个Column对象，该对象有一个cast函数。在案例#2中，您已经通过了第一个问题，但是现在Py4J异常表示没有以列表作为第一个参数的withColumn函数。它必须是一个字符串，指定新的列名。我的问题中确实有一个错误，关于具有列名的变量。忽略这一点，我仍然