Apache spark 分区表上的配置单元增量
我正致力于在配置单元表A上实现增量过程; 表A-已在配置单元中创建,已在YearMonth YYYYMM列上分区,并具有完整卷 我们计划继续从源导入更新/插入,并在配置单元增量表中捕获 如下图所示,增量表表明新的更新与分区201804/201611/201705有关 对于增量过程,我计划 从原始表中选择3个受影响的分区。 插入到delta2中,从select中的YYYYMM所在的表中选择YYYYMM 与Delta不同的YYYYMM 将增量表中的这3个分区与原始表中的相应分区合并。我可以按照Horton works的4步策略应用更新Apache spark 分区表上的配置单元增量,apache-spark,hive,apache-spark-sql,hiveql,hive-partitions,Apache Spark,Hive,Apache Spark Sql,Hiveql,Hive Partitions,我正致力于在配置单元表A上实现增量过程; 表A-已在配置单元中创建,已在YearMonth YYYYMM列上分区,并具有完整卷 我们计划继续从源导入更新/插入,并在配置单元增量表中捕获 如下图所示,增量表表明新的更新与分区201804/201611/201705有关 对于增量过程,我计划 从原始表中选择3个受影响的分区。 插入到delta2中,从select中的YYYYMM所在的表中选择YYYYMM 与Delta不同的YYYYMM 将增量表中的这3个分区与原始表中的相应分区合并。我可以按照Hor
Merge Delta2 + Delta : = new 3 partitions.
从原始表中删除3个分区
Alter Table Drop partitions 201804 / 201611 / 201705
将新合并的分区添加回具有新更新的原始表
我需要使这个脚本自动化—您能建议如何将上述逻辑放入hive QL或spark—特别是识别分区并将其从原始表中删除
您可以使用pyspark构建解决方案。我用一些基本的例子来解释这种方法。您可以根据业务需求重新修改它 假设您在配置下面的配置单元中有一个分区表
CREATE TABLE IF NOT EXISTS udb.emp_partition_Load_tbl (
emp_id smallint
,emp_name VARCHAR(30)
,emp_city VARCHAR(10)
,emp_dept VARCHAR(30)
,emp_salary BIGINT
)
PARTITIONED BY (Year String, Month String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS ORC;
您得到了一些csv文件,其中包含一些输入记录,您希望将这些记录加载到分区表中
1|vikrant singh rana|Gurgaon|Information Technology|20000
dataframe = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "false") \
.option("inferschema", "true") \
.schema(userschema) \
.option("delimiter", "|").load("file:///filelocation/userinput")
newdf = dataframe.withColumn('year', lit('2018')).withColumn('month',lit('01'))
+------+------------------+--------+----------------------+----------+----+-----+
|emp-id|emp-name |emp-city|emp-department |emp-salary|year|month|
+------+------------------+--------+----------------------+----------+----+-----+
|1 |vikrant singh rana|Gurgaon |Information Technology|20000 |2018|01 |
+------+------------------+--------+----------------------+----------+----+-----+
设置以下属性以仅覆盖特定分区数据。
假设您获得了另一组数据,并希望插入到其他分区中
+------+--------+--------+--------------+----------+----+-----+
|emp-id|emp-name|emp-city|emp-department|emp-salary|year|month|
+------+--------+--------+--------------+----------+----+-----+
| 2| ABC| Gurgaon|HUMAN RESOURCE| 10000|2018| 02|
+------+--------+--------+--------------+----------+----+-----+
newdf.write.format('orc').mode("overwrite").insertInto('udb.emp_partition_Load_tbl')
> show partitions udb.emp_partition_Load_tbl;
+---------------------+--+
| partition |
+---------------------+--+
| year=2018/month=01 |
| year=2018/month=02 |
+---------------------+--+
假设您有另一组与现有分区相关的记录
3|XYZ|Gurgaon|HUMAN RESOURCE|80000
newdf = dataframe.withColumn('year', lit('2018')).withColumn('month',lit('02'))
+------+--------+--------+--------------+----------+----+-----+
|emp-id|emp-name|emp-city|emp-department|emp-salary|year|month|
+------+--------+--------+--------------+----------+----+-----+
| 3| XYZ| Gurgaon|HUMAN RESOURCE| 80000|2018| 02|
+------+--------+--------+--------------+----------+----+-----+
newdf.write.format('orc').mode("overwrite").insertInto('udb.emp_partition_Load_tbl')
select * from udb.emp_partition_Load_tbl where year ='2018' and month ='02';
+---------+-----------+-----------+-----------------+-------------+-------+--------+--+
| emp_id | emp_name | emp_city | emp_dept | emp_salary | year | month |
+---------+-----------+-----------+-----------------+-------------+-------+--------+--+
| 3 | XYZ | Gurgaon | HUMAN RESOURCE | 80000 | 2018 | 02 |
| 2 | ABC | Gurgaon | HUMAN RESOURCE | 10000 | 2018 | 02 |
+---------+-----------+-----------+-----------------+-------------+-------+--------+--+
您可以在下面看到,其他分区数据未被触及
> select * from udb.emp_partition_Load_tbl where year ='2018' and month ='01';
+---------+---------------------+-----------+-------------------------+-------------+-------+--------+--+
| emp_id | emp_name | emp_city | emp_dept | emp_salary | year | month |
+---------+---------------------+-----------+-------------------------+-------------+-------+--------+--+
| 1 | vikrant singh rana | Gurgaon | Information Technology | 20000 | 2018 | 01 |
+---------+---------------------+-----------+-------------------------+-------------+-------+--------+--+
在一条语句中这样做:可以使用where partition\u col in select distinct partition\u col from Delta来限制主表中要覆盖的分区。使用动态分区:设置hive.exec.Dynamic.partition=true;设置hive.exec.dynamic.partition.mode=nonstrict;
> select * from udb.emp_partition_Load_tbl where year ='2018' and month ='01';
+---------+---------------------+-----------+-------------------------+-------------+-------+--------+--+
| emp_id | emp_name | emp_city | emp_dept | emp_salary | year | month |
+---------+---------------------+-----------+-------------------------+-------------+-------+--------+--+
| 1 | vikrant singh rana | Gurgaon | Information Technology | 20000 | 2018 | 01 |
+---------+---------------------+-----------+-------------------------+-------------+-------+--------+--+