Apache spark 带蜂箱的Pypark-can'；t使用分区正确创建并从数据帧保存表_Apache Spark_Hive_Pyspark_Pyspark Sql

Apache spark 带蜂箱的Pypark-can'；t使用分区正确创建并从数据帧保存表

apache-spark hive pyspark

Apache spark 带蜂箱的Pypark-can'；t使用分区正确创建并从数据帧保存表,apache-spark,hive,pyspark,pyspark-sql,Apache Spark,Hive,Pyspark,Pyspark Sql,我试图通过很少的转换（添加日期）将json文件转换为parquet，但在将其保存到parquet之前，我需要对这些数据进行分区我在这一带碰到了一堵墙以下是表格的创建过程： df_temp = spark.read.json(data_location) \ .filter( cond3 ) df_temp = df_temp.withColumn("date", fn.to_date(fn.lit(today.strf

我试图通过很少的转换（添加日期）将json文件转换为parquet，但在将其保存到parquet之前，我需要对这些数据进行分区

我在这一带碰到了一堵墙

以下是表格的创建过程：


    df_temp = spark.read.json(data_location) \
        .filter(
            cond3
        )
    df_temp = df_temp.withColumn("date", fn.to_date(fn.lit(today.strftime("%Y-%m-%d"))))
    df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))

    spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
    spark.sql("DESC {}".format("duration"))

然后，关于转换的保存：

    df_final.write.mode("append").format("parquet").partitionBy("customer_id", "date").saveAsTable('duration')

但这会产生以下错误：

pyspark.sql.utils.AnalysisException:'\n指定的分区与现有表的默认分区不匹配。持续时间。\n指定的分区列：[customer\u id，date]\n现有分区列：[]\n；'

模式为：

    root
     |-- action_id: string (nullable = true)
     |-- customer_id: string (nullable = true)
     |-- duration: long (nullable = true)
     |-- initial_value: string (nullable = true)
     |-- item_class: string (nullable = true)
     |-- set_value: string (nullable = true)
     |-- start_time: string (nullable = true)
     |-- stop_time: string (nullable = true)
     |-- undo_event: string (nullable = true)
     |-- year: integer (nullable = true)
     |-- month: integer (nullable = true)
     |-- day: integer (nullable = true)
     |-- date: date (nullable = true)

因此，我尝试将创建表更改为：

    spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp PARTITIONED BY (customer_id, date) LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))

但这会产生如下错误：

…不匹配的输入“已分区”，应为

所以我发现PARTITIONED BY不能像那样与

一起工作，但我已经没有想法了。
如果使用使用
而不是像这样使用，我得到了错误：
pyspark.sql.utils.AnalysisException:'未定义表架构时，不允许指定分区列。未提供表架构时，将推断架构和分区列。；'
在创建表时，我应该如何添加分区
Ps-一旦使用分区定义了表的模式，我只想使用：
我终于想出了用spark怎么做
    df_temp.read.json...

    df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))

    spark.sql("""
    CREATE TABLE IF NOT EXISTS {1}
    USING PARQUET
    PARTITIONED BY (customer_id, date)
    LOCATION '{2}/{1}' AS SELECT * FROM {0}_tmp
    """.format("duration_small","duration", warehouse_location))

    spark.sql("DESC {}".format("duration"))

    df_temp.write.mode("append").partitionBy("customer_id", "date").saveAsTable('duration')

我不知道为什么，但如果我不能使用insertInto，它会使用一个奇怪的客户id，而且不会附加不同的日期。是否已经定义了工期表？然后它没有分区，但是您尝试用分区来附加数据。好吧，它是在CREATE表上定义的，我正在尝试找出如何用分区来创建它
    df_temp.read.json...

    df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))

    spark.sql("""
    CREATE TABLE IF NOT EXISTS {1}
    USING PARQUET
    PARTITIONED BY (customer_id, date)
    LOCATION '{2}/{1}' AS SELECT * FROM {0}_tmp
    """.format("duration_small","duration", warehouse_location))

    spark.sql("DESC {}".format("duration"))

    df_temp.write.mode("append").partitionBy("customer_id", "date").saveAsTable('duration')