Apache spark Spark dataframe覆盖配置单元表数据但不删除旧数据_Apache Spark_Amazon S3_Pyspark_Hive_Apache Spark Sql

Apache spark Spark dataframe覆盖配置单元表数据但不删除旧数据

apache-spark amazon-s3 pyspark hive

Apache spark Spark dataframe覆盖配置单元表数据但不删除旧数据,apache-spark,amazon-s3,pyspark,hive,apache-spark-sql,Apache Spark,Amazon S3,Pyspark,Hive,Apache Spark Sql,我试图在spark覆盖表之前添加表路径，以避免找不到表路径。但是，使用s3 put_对象作为表路径，spark不会删除旧数据。相反，它的作用类似于append，而不是overwrite 复制： # hive external table data saved on S3 test_table path: s3a://test_bucket/test_table/ df = spark_session.sql("SELECT * FROM test_table") df

我试图在spark覆盖表之前添加表路径，以避免找不到表路径。但是，使用s3 put_对象作为表路径，spark不会删除旧数据。相反，它的作用类似于

append

，而不是

overwrite

复制：

# hive external table data saved on S3
test_table path: s3a://test_bucket/test_table/

 
df = spark_session.sql("SELECT * FROM test_table")

df.count()  # produce row number 1000

#####S3 operation######

s3 = boto3.client("s3")
s3.put_object(
    Bucket="test_bucket", Body="", Key=f"test_table/"
)

#####S3 operation######

df.write.insertInto(test_table, overwrite=True)

#Same goes to df.write.save(mode="overwrite", format="parquet", path="s3a://test_bucket/test_table")

df = spark_session.sql("SELECT * FROM test_table")

df.count()  # produce row number 2000

如果没有S3操作，每次都会删除旧数据