Dataframe PySpark：将架构应用于数据帧_Dataframe_Pyspark_Schema_Databricks

Dataframe PySpark：将架构应用于数据帧

dataframe pyspark

Dataframe PySpark：将架构应用于数据帧,dataframe,pyspark,schema,databricks,Dataframe,Pyspark,Schema,Databricks,我有几个数据帧，所有列都是StringType 我想应用一个自定义模式来更改类型，并且一切正常（没有错误消息）。但是在所有的转换之后，我无法将例如count（）应用于新的数据帧。以下是我的工作： my_df_1 = spark.createDataFrame( [ ("1", 'foo'), ("2", 'bar'), ], ['id', 'txt'] ) my_df_2 = spark.creat

我有几个数据帧，所有列都是StringType

我想应用一个自定义模式来更改类型，并且一切正常（没有错误消息）。但是在所有的转换之后，我无法将例如count（）应用于新的数据帧。以下是我的工作：

my_df_1 = spark.createDataFrame(
    [
        ("1", 'foo'),
        ("2", 'bar'),
    ],
    ['id', 'txt']
)

my_df_2 = spark.createDataFrame(
    [
        ("1", 'foo'),
        ("2", 'bar'),
    ],
    ['id', 'txt']
)

#union the two DataFrames
df_union = my_df_1.unionAll(my_df_2)

#Here count still works
df_union.count()

#Create a schema
from pyspark.sql import types as T
schema = T.StructType([T.StructField('id', T.IntegerType(), True),
                     T.StructField('txt', T.StringType(), True)])

#Convert the DataFrame to RDD and apply the schema to the DataFrame
schema_df = spark.createDataFrame(df_union.rdd, schema=schema)

#And this throws an error
schema_df.count()

#Error: TypeError: field id: IntegerType can not accept object '2' in type <class 'str'>

my_df_1=spark.createDataFrame(
[
（“1”，“foo”），
（“2”和“bar”），
],
['id'，'txt']
)
my_df_2=spark.createDataFrame(
[
（“1”，“foo”），
（“2”和“bar”），
],
['id'，'txt']
)
#合并两个数据帧
df_union=my_df_1.unionAll（my_df_2）
#在这里，伯爵仍然在工作
df_union.count（）
#创建一个模式
从pyspark.sql导入类型为T
schema=T.StructType（[T.StructField（'id'，T.IntegerType（），True），
T.StructField（'txt'，T.StringType（），True）]）
#将数据帧转换为RDD，并将模式应用于数据帧
schema_df=spark.createDataFrame（df_union.rdd，schema=schema）
#这会抛出一个错误
schema_df.count（）
#错误：TypeError:字段id:IntegerType无法接受类型中的对象“2”

我哪里出错了？

创建数据帧时，Id是字符串类型。不能像这样将字符串更改为整数。

如果您需要更改列的类型，您可以在列级别使用cast（'int'）只需将id列按如下方式进行cast即可

df_union = df_union.withColumn("id", F.col("id").cast(T.IntegerType()))
schema_df = spark.createDataFrame(df_union.rdd, schema=df_union.schema)

显然我可以，因为我没有得到错误，而且数据帧在“控制台”中显示它具有正确的类型。“引擎盖下”似乎不是这样。我知道我可以使用case（），但铸造200列是我尽量避免的事情。必须有更好的解决办法。但是谢谢你的评论和花时间。我“不能”投，因为在最初的数据框中我有200列-“投”将是一场噩梦。如果我像您在示例中建议的那样进行强制转换，那么我就不需要“schema_df”。，因为“df_union”已经有了正确的类型。谢谢你的回答！