Dictionary 使用Spark sql中的键值（和条件）组合筛选配置单元映射列时出现问题_Dictionary_Hive_Apache Spark Sql_Hiveql

Dictionary 使用Spark sql中的键值（和条件）组合筛选配置单元映射列时出现问题

dictionary hive

Dictionary 使用Spark sql中的键值（和条件）组合筛选配置单元映射列时出现问题,dictionary,hive,apache-spark-sql,hiveql,Dictionary,Hive,Apache Spark Sql,Hiveql,我有一个配置单元表，带有MapColumnType键，它将值存储在键值对中。我需要编写组合两个键值的过滤条件示例数据集： +---------------+--------------+----------------------+ | column_value | metric_name | key | +---------------+--------------+----------------------+ | A37B | M

我有一个配置单元表，带有MapColumnType键，它将值存储在键值对中。我需要编写组合两个键值的过滤条件

示例数据集：

+---------------+--------------+----------------------+
| column_value  | metric_name  |         key          |
+---------------+--------------+----------------------+
| A37B          | Mean         | {0:"202006",1:"1"}  |
| ACCOUNT_ID    | Mean         | {0:"202006",1:"2"}  |
| ANB_200       | Mean         | {0:"202006",1:"3"}  |
| ANB_201       | Mean         | {0:"202006",1:"4"}  |
| AS82_RE       | Mean         | {0:"202006",1:"5"}  |
| ATTR001       | Mean         | {0:"202007",1:"2"}  |
| ATTR001_RE    | Mean         | {0:"202007",1:"3"}  |
| ATTR002       | Mean         | {0:"202007",1:"4"}  |
| ATTR002_RE    | Mean         | {0:"202007",1:"5"}  |
| ATTR003       | Mean         | {0:"202008",1:"3"}  |
| ATTR004       | Mean         | {0:"202008",1:"4"}  |
| ATTR005       | Mean         | {0:"202008",1:"5"}  |
| ATTR006       | Mean         | {0:"202009",1:"4"}  |
| ATTR006       | Mean         | {0:"202009",1:"5"}  |

我需要编写一个sparksql查询，以根据键列进行筛选，但不符合两个键的组合条件

select * from table where key[0] between 202006 and 202009 and key NOT IN (0:"202009",1:"5)

预期产出：

+---------------+--------------+----------------------+
| column_value  | metric_name  |         key          |
+---------------+--------------+----------------------+
| A37B          | Mean         | {0:"202006",1:"1"}  |
| ACCOUNT_ID    | Mean         | {0:"202006",1:"2"}  |
| ANB_200       | Mean         | {0:"202006",1:"3"}  |
| ANB_201       | Mean         | {0:"202006",1:"4"}  |
| AS82_RE       | Mean         | {0:"202006",1:"5"}  |
| ATTR001       | Mean         | {0:"202007",1:"2"}  |
| ATTR001_RE    | Mean         | {0:"202007",1:"3"}  |
| ATTR002       | Mean         | {0:"202007",1:"4"}  |
| ATTR002_RE    | Mean         | {0:"202007",1:"5"}  |
| ATTR003       | Mean         | {0:"202008",1:"3"}  |
| ATTR004       | Mean         | {0:"202008",1:"4"}  |
| ATTR005       | Mean         | {0:"202008",1:"5"}  |
| ATTR006       | Mean         | {0:"202009",1:"4"}  |

检查下面的代码

使用Spark Scala

创建架构以匹配键列值

scala> import org.apache.spark.sql.types._

scala>  val schema = DataType
.fromJson("""{"type":"struct","fields":[{"name":"0","type":"string","nullable":true,"metadata":{}},{"name":"1","type":"string","nullable":true,"metadata":{}}]}""")
.asInstanceOf[StructType]

打印键列的架构

将模式json应用于DataFrame中的键列

scala> :paste
// Convert key column values to valid json & then apply schema json.
df
.withColumn("key_new",
    from_json(
        regexp_replace(
            regexp_replace(
                $"key",
                "0:",
                "\"0\":"
            ),
            "1:",
            "\"1\":"
        ),
        schema
    )
)
.filter(
    $"key_new.0".between(202006,202009) &&
    !($"key_new.0" === 202009 && $"key_new.1" === 5)
).show(false)

最终产量

使用sparksql