Apache spark 在PySpark中将RDD转换为数据帧_Apache Spark_Pyspark_Rdd

Apache spark 在PySpark中将RDD转换为数据帧

apache-spark pyspark

Apache spark 在PySpark中将RDD转换为数据帧,apache-spark,pyspark,rdd,Apache Spark,Pyspark,Rdd,我无法在pyspark中将RDD数据转换为数据帧这是我写的代码 from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, Row from pyspark.sql import * spark = SparkSession \ .builder \ .appName

我无法在pyspark中将RDD数据转换为数据帧

这是我写的代码

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, Row
from pyspark.sql import *
spark = SparkSession \
        .builder \
        .appName("pyspark") \
        .master("local[3]") \
        .getOrCreate()  
empdata = spark.sparkContext.textFile("/FileStore/tables/empdatarevised.txt").map(lambda x: x.split(","))        
schema = StructType([
        StructField("eid",IntegerType(),True),
        StructField("ename",StringType(),True),
        StructField("edept",StringType(),True),
        StructField("esal", IntegerType(), True),
        StructField("revsal", DoubleType(), True),
        ])
df = spark.createDataFrame(data=empdata,schema=schema)
df.show()

我发现了错误

org.apache.spark.SparkException:作业因阶段失败而中止：阶段5.0中的任务0失败1次，最近的失败：阶段5.0中的任务0.0丢失（TID 7）（ip-10-172-239-64.us-west-2.compute.internal executor driver）：org.apache.spark.api.python.python异常：“类型错误：字段eid:IntegerType无法接受类型中的对象“100”。完整回溯如下：

我知道这可以通过

spark.read.format（“csv”）.load（“file.txt”）

实现，但我的目的是将RDD转换为StructType数据帧

寻求你的帮助

提前感谢。

从RDD创建数据帧时，Spark无法将字符串转换为整数/双精度。您可以显式地更改RDD中条目的类型，例如：

empdata = (sc.textFile("/FileStore/tables/empdatarevised.txt")
             .map(lambda x: x.split(","))
             .map(lambda x: [int(x[0]), x[1], x[2], int(x[3]), float(x[4])])
          )

df = spark.createDataFrame(data=empdata,schema=schema)