Dictionary 使用Spark sql中的键值(和条件)组合筛选配置单元映射列时出现问题
我有一个配置单元表,带有MapColumnType键,它将值存储在键值对中。 我需要编写组合两个键值的过滤条件 示例数据集:Dictionary 使用Spark sql中的键值(和条件)组合筛选配置单元映射列时出现问题,dictionary,hive,apache-spark-sql,hiveql,Dictionary,Hive,Apache Spark Sql,Hiveql,我有一个配置单元表,带有MapColumnType键,它将值存储在键值对中。 我需要编写组合两个键值的过滤条件 示例数据集: +---------------+--------------+----------------------+ | column_value | metric_name | key | +---------------+--------------+----------------------+ | A37B | M
+---------------+--------------+----------------------+
| column_value | metric_name | key |
+---------------+--------------+----------------------+
| A37B | Mean | {0:"202006",1:"1"} |
| ACCOUNT_ID | Mean | {0:"202006",1:"2"} |
| ANB_200 | Mean | {0:"202006",1:"3"} |
| ANB_201 | Mean | {0:"202006",1:"4"} |
| AS82_RE | Mean | {0:"202006",1:"5"} |
| ATTR001 | Mean | {0:"202007",1:"2"} |
| ATTR001_RE | Mean | {0:"202007",1:"3"} |
| ATTR002 | Mean | {0:"202007",1:"4"} |
| ATTR002_RE | Mean | {0:"202007",1:"5"} |
| ATTR003 | Mean | {0:"202008",1:"3"} |
| ATTR004 | Mean | {0:"202008",1:"4"} |
| ATTR005 | Mean | {0:"202008",1:"5"} |
| ATTR006 | Mean | {0:"202009",1:"4"} |
| ATTR006 | Mean | {0:"202009",1:"5"} |
我需要编写一个sparksql查询,以根据键列进行筛选,但不符合两个键的组合条件
select * from table where key[0] between 202006 and 202009 and key NOT IN (0:"202009",1:"5)
预期产出:
+---------------+--------------+----------------------+
| column_value | metric_name | key |
+---------------+--------------+----------------------+
| A37B | Mean | {0:"202006",1:"1"} |
| ACCOUNT_ID | Mean | {0:"202006",1:"2"} |
| ANB_200 | Mean | {0:"202006",1:"3"} |
| ANB_201 | Mean | {0:"202006",1:"4"} |
| AS82_RE | Mean | {0:"202006",1:"5"} |
| ATTR001 | Mean | {0:"202007",1:"2"} |
| ATTR001_RE | Mean | {0:"202007",1:"3"} |
| ATTR002 | Mean | {0:"202007",1:"4"} |
| ATTR002_RE | Mean | {0:"202007",1:"5"} |
| ATTR003 | Mean | {0:"202008",1:"3"} |
| ATTR004 | Mean | {0:"202008",1:"4"} |
| ATTR005 | Mean | {0:"202008",1:"5"} |
| ATTR006 | Mean | {0:"202009",1:"4"} |
检查下面的代码
使用Spark Scala
创建架构以匹配键列值
scala> import org.apache.spark.sql.types._
scala> val schema = DataType
.fromJson("""{"type":"struct","fields":[{"name":"0","type":"string","nullable":true,"metadata":{}},{"name":"1","type":"string","nullable":true,"metadata":{}}]}""")
.asInstanceOf[StructType]
打印键列的架构
将模式json应用于DataFrame中的键列
scala> :paste
// Convert key column values to valid json & then apply schema json.
df
.withColumn("key_new",
from_json(
regexp_replace(
regexp_replace(
$"key",
"0:",
"\"0\":"
),
"1:",
"\"1\":"
),
schema
)
)
.filter(
$"key_new.0".between(202006,202009) &&
!($"key_new.0" === 202009 && $"key_new.1" === 5)
).show(false)
最终产量
使用sparksql
检查下面的代码
使用Spark Scala
创建架构以匹配键列值
scala> import org.apache.spark.sql.types._
scala> val schema = DataType
.fromJson("""{"type":"struct","fields":[{"name":"0","type":"string","nullable":true,"metadata":{}},{"name":"1","type":"string","nullable":true,"metadata":{}}]}""")
.asInstanceOf[StructType]
打印键列的架构
将模式json应用于DataFrame中的键列
scala> :paste
// Convert key column values to valid json & then apply schema json.
df
.withColumn("key_new",
from_json(
regexp_replace(
regexp_replace(
$"key",
"0:",
"\"0\":"
),
"1:",
"\"1\":"
),
schema
)
)
.filter(
$"key_new.0".between(202006,202009) &&
!($"key_new.0" === 202009 && $"key_new.1" === 5)
).show(false)
最终产量
使用sparksql
使用映射函数将非IN参数转换为映射:
select * from your_data
where key[0] between 202006 and 202009
and key NOT IN ( map(0,"202009",1,"5") ); --can be many map() comma separated
使用映射函数将非IN参数转换为映射:
select * from your_data
where key[0] between 202006 and 202009
and key NOT IN ( map(0,"202009",1,"5") ); --can be many map() comma separated
无法在Spark SQL中执行此筛选键列的数据类型是什么?键列的数据类型是映射数据类型。无法在Spark SQL中执行此筛选键列的数据类型是什么?键列的数据类型是映射数据类型。此查询在HiveQL中工作正常。但当我在Spark SQL中尝试此查询时,它不起作用。需要帮助@Arvinth如果它是MAP函数,那么在NOTkey[0]=202009且key[1]=5的情况下尝试此操作,而不是map020009,1,5中的key NOT。此查询在HiveQL中工作正常。但当我在Spark SQL中尝试此查询时,它不起作用。需要帮助@Arvinth如果它是MAP函数,则在NOTkey[0]=202009且key[1]=5的情况下尝试此操作,而不是map020009,1,5中的key
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.sql("""
WITH table_data AS (
SELECT
column_value,
metric_name,
key,
get_json_object(replace(replace(key,'0:','\"0\":'),'1:','\"1\":'),'$.0') as k,
get_json_object(replace(replace(key,'0:','\"0\":'),'1:','\"1\":'),'$.1') as v
FROM data
)
SELECT
column_value,
metric_name,
key,
k,
v
FROM table_data
WHERE
(k between 202006 and 202009) AND
!(k = 202009 AND V = 5)
""").show(false)
// Exiting paste mode, now interpreting.
+------------+-----------+------------------+------+---+
|column_value|metric_name|key |k |v |
+------------+-----------+------------------+------+---+
|A37B |Mean |{0:"202006",1:"1"}|202006|1 |
|ACCOUNT_ID |Mean |{0:"202006",1:"2"}|202006|2 |
|ANB_200 |Mean |{0:"202006",1:"3"}|202006|3 |
|ANB_201 |Mean |{0:"202006",1:"4"}|202006|4 |
|AS82_RE |Mean |{0:"202006",1:"5"}|202006|5 |
|ATTR001 |Mean |{0:"202007",1:"2"}|202007|2 |
|ATTR001_RE |Mean |{0:"202007",1:"3"}|202007|3 |
|ATTR002 |Mean |{0:"202007",1:"4"}|202007|4 |
|ATTR002_RE |Mean |{0:"202007",1:"5"}|202007|5 |
|ATTR003 |Mean |{0:"202008",1:"3"}|202008|3 |
|ATTR004 |Mean |{0:"202008",1:"4"}|202008|4 |
|ATTR005 |Mean |{0:"202008",1:"5"}|202008|5 |
|ATTR006 |Mean |{0:"202009",1:"4"}|202009|4 |
+------------+-----------+------------------+------+---+
scala>
select * from your_data
where key[0] between 202006 and 202009
and key NOT IN ( map(0,"202009",1,"5") ); --can be many map() comma separated