Python 如何从按月分区的拼花地板文件中删除特定月份_Python_Apache Spark_Pyspark_Parquet

Python 如何从按月分区的拼花地板文件中删除特定月份

python apache-spark pyspark

Python 如何从按月分区的拼花地板文件中删除特定月份,python,apache-spark,pyspark,parquet,Python,Apache Spark,Pyspark,Parquet,我拥有过去5年的月度收入数据，并以append模式，但month列的格式存储各个月份的数据帧。下面是伪代码- def Revenue(filename): df = spark.read.load(filename) . . df.write.format('parquet').mode('append').partitionBy('month').save('/path/Revenue') Revenue('Revenue_201501.csv') Reven

我拥有过去5年的

月度收入数据，并以append
模式，但month
列的格式存储各个月份的数据帧。下面是伪代码-
def Revenue(filename):
    df = spark.read.load(filename)
    .
    .
    df.write.format('parquet').mode('append').partitionBy('month').save('/path/Revenue')

Revenue('Revenue_201501.csv')
Revenue('Revenue_201502.csv')
Revenue('Revenue_201503.csv')
Revenue('Revenue_201504.csv')
Revenue('Revenue_201505.csv')

df
每月以parquet
格式存储，如下所示-

问题：如何删除与特定月份对应的parquet
文件夹
一种方法是在一个大的df
中加载所有这些parquet
文件，然后使用.where（）
子句筛选出特定月份，然后在overwrite
模式下将其保存回parquet
格式partitionBy
月份，如下所示-
# If we want to remove data from Feb, 2015
df = spark.read.format('parquet').load('Revenue.parquet')
df = df.where(col('month') != lit('2015-02-01'))
df.write.format('parquet').mode('overwrite').partitionBy('month').save('/path/Revenue')

但是，这种方法相当麻烦
另一种方法是直接删除特定月份的文件夹，但我不确定这是否是正确的方法，以免我们以不可预见的方式更改元数据
删除特定月份的parquet
数据的正确方法是什么？
Spark支持删除分区，包括数据和元数据。

引用scala代码注释
/**
 * Drop Partition in ALTER TABLE: to drop a particular partition for a table.
 *
 * This removes the data and metadata for this partition.
 * The data is actually moved to the .Trash/Current directory if Trash is configured,
 * unless 'purge' is true, but the metadata is completely lost.
 * An error message will be issued if the partition does not exist, unless 'ifExists' is true.
 * Note: purge is always false when the target is a view.
 *
 * The syntax of this command is:
 * {{{
 *   ALTER TABLE table DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...] [PURGE];
 * }}}
 */

在您的情况下，没有支持表。
我们可以将dataframe注册为临时表，并使用上面的语法（）
在pyspark中，我们可以使用以下语法运行SQL
样本：
下面的语句将只删除与分区信息相关的元数据
ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");

如果还要删除数据，则需要将配置单元外部表的TBLProperty设置为False。它会将您的配置单元表设置为托管表
alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='FALSE');

您可以将其设置回外部表
alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='TRUE');

我尝试使用spark会话设置给定属性，但遇到了一些问题
 spark.sql("""alter table db.test_external set tblproperties ("EXTERNAL"="TRUE")""")
pyspark.sql.utils.AnalysisException: u"Cannot set or change the preserved property key: 'EXTERNAL';"

我相信一定有办法做到这一点。我最终使用了python。我在pyspark中定义了下面的函数，它完成了这项工作
query=""" hive -e 'alter table db.yourtable set tblproperties ("EXTERNAL"="FALSE");ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");' """

def delete_partition():
        print("I am here")
        import subprocess
        import sys
        p=subprocess.Popen(query,shell=True,stderr=subprocess.PIPE)
        stdout,stderr = p.communicate()
        if p.returncode != 0:
            print stderr
            sys.exit(1) 

>>> delete_partition()

这将同时删除元数据和数据。
笔记我已经用Hive ORC外部分区表对此进行了测试，该表在加载日期进行分区
# Partition Information
# col_name              data_type               comment

loaded_date             string

更新：
基本上，您的数据位于名为
/Revenue/month=2015-02-01
/Revenue/month=2015-03-01
/Revenue/month=2015-03-01

等等
def delete_partition(month_delete):
      print("I am here")
      hdfs_path="/some_hdfs_location/Revenue/month="
      final_path=hdfs_path+month_delete
      import subprocess
      subprocess.call(["hadoop", "fs", "-rm", "-r", final_path])
      print("got deleted")

delete_partition("2015-02-01")

如果您稍后选择，这里有一个链接可以进行很好的讨论，但它不是您原始问题的答案。张贴它只是为了参考@vikrantrana非常感谢Vikrant推荐我访问链接。让我试着理解一下。请看下面的答案。它可以作为指向原始问题的指针。您必须根据拼花地板格式或您的分区列做一些更改。如果你想用spark函数实现它，也请告诉我。这个问题似乎已经讨论了很久了。不太确定，但可能是。你需要做第一部分吗？df=。。。我倾向于在脚本中这样做。如果是这样的话，那就很有趣了。我们可以像中提到的那样直接编写SQL。我不确定altertable
如何符合链接中提到的语法。@DaRkMaN嗨！根据您的回答，我发现使用PURGE
选项的Alter TABLE
仅在HDFS
上的数据为内部表时删除，而不是外部表时删除。我的是external
one。如何从HDFS中删除相应的数据？正如您在评论中提到的，此ALTER TABLE…
code不适合PySpark框架。也许它可以在HIVE上运行，但我必须在PySpark
中运行。谢谢你的努力。好吧，我会在周一试试，然后通知你。好吧，我试着调查一下，但是因为我在蜂巢中没有代码，而是直接在PySpark
Jupyter
，所以我不知道在我的情况下会是什么。。您正在将数据帧直接保存到某个hdfs位置。我来检查一下。谢谢，是的，你完全正确。我已将我的df
以parquet
格式保存在HDFS上，按month
分区，如问题所示。我正在从那里直接加载我的df。Vikrant:）
def delete_partition(month_delete):
      print("I am here")
      hdfs_path="/some_hdfs_location/Revenue/month="
      final_path=hdfs_path+month_delete
      import subprocess
      subprocess.call(["hadoop", "fs", "-rm", "-r", final_path])
      print("got deleted")

delete_partition("2015-02-01")