Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/amazon-s3/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark dataframe覆盖配置单元表数据但不删除旧数据_Apache Spark_Amazon S3_Pyspark_Hive_Apache Spark Sql - Fatal编程技术网

Apache spark Spark dataframe覆盖配置单元表数据但不删除旧数据

Apache spark Spark dataframe覆盖配置单元表数据但不删除旧数据,apache-spark,amazon-s3,pyspark,hive,apache-spark-sql,Apache Spark,Amazon S3,Pyspark,Hive,Apache Spark Sql,我试图在spark覆盖表之前添加表路径,以避免找不到表路径。但是,使用s3 put_对象作为表路径,spark不会删除旧数据。相反,它的作用类似于append,而不是overwrite 复制: # hive external table data saved on S3 test_table path: s3a://test_bucket/test_table/ df = spark_session.sql("SELECT * FROM test_table") df

我试图在spark覆盖表之前添加表路径,以避免找不到表路径。但是,使用s3 put_对象作为表路径,spark不会删除旧数据。相反,它的作用类似于
append
,而不是
overwrite

复制:

# hive external table data saved on S3
test_table path: s3a://test_bucket/test_table/

 
df = spark_session.sql("SELECT * FROM test_table")

df.count()  # produce row number 1000

#####S3 operation######

s3 = boto3.client("s3")
s3.put_object(
    Bucket="test_bucket", Body="", Key=f"test_table/"
)

#####S3 operation######

df.write.insertInto(test_table, overwrite=True)

#Same goes to df.write.save(mode="overwrite", format="parquet", path="s3a://test_bucket/test_table")

df = spark_session.sql("SELECT * FROM test_table")

df.count()  # produce row number 2000

如果没有S3操作,每次都会删除旧数据