将任意JSON多次嵌套结构转换为键和值字段
我被要求在Azure中构建一个ETL管道。这条管道应该将任意JSON多次嵌套结构转换为键和值字段,json,tsql,azure-data-factory,databricks,Json,Tsql,Azure Data Factory,Databricks,我被要求在Azure中构建一个ETL管道。这条管道应该 读取供应商提交给ADLS的ORC文件 解析ORC结构中存储JSON结构的PARAMS字段,并将其作为两个新字段(KEY、VALUE)添加到输出中 将输出写入Azure SQL数据库 问题是,不同类型的记录使用不同类型的JSON结构。我不想为JSON结构的每个类编写自定义表达式(可能有数百个)。相反,我正在寻找一种通用机制,它能够将它们解析为输入JSON结构类型的一部分 目前,为了满足这一要求,我正在为ORC使用ADF内置连接器。当前设计中的
SELECT uuid,
AttrName = a1.[key] +
COALESCE('.' + a2.[key], '') +
COALESCE('.' + a3.[key], '') +
COALESCE('.' + a4.[key], ''),
AttrValue = COALESCE(a4.value, a3.value, a2.value, a1.value)
FROM ORC.EventsSnapshot_RawData
OUTER APPLY OPENJSON(params) a1
OUTER APPLY
(
SELECT [key],
value,
type
FROM OPENJSON(a1.value)
WHERE ISJSON(a1.value) = 1
) a2
OUTER APPLY
(
SELECT [key],
value,
type
FROM OPENJSON(a2.value)
WHERE ISJSON(a2.value) = 1
) a3
OUTER APPLY
(
SELECT [key],
value,
type
FROM OPENJSON(a3.value)
WHERE ISJSON(a3.value) = 1
) a4
sp_executesql
不幸的是,这种方法在执行时间方面效率很低,因为对于11 MM的记录,需要3.5小时才能完成
有人建议我使用数据块。好的,所以我:
orcfile = "/mnt/adls/.../Input/*.orc"
eventDf = spark.read.orc(orcfile)
#spark.sql("drop table if exists ORC.Events_RawData")
eventDf.write.mode("overwrite").saveAsTable("ORC.Events_Raw")
{
"correlationId": "c3xOeEEQQCCA9sEx7-u6FA",
"eventCreateTime": "2020-05-12T15:38:23.717Z",
"time": 1589297903717,
"owner": {
"ownergeography": {
"city": "abc",
"country": "abc"
},
"ownername": {
"firstname": "abc",
"lastname": "def"
},
"clientApiKey": "xxxxx",
"businessProfileApiKey": null,
"userId": null
},
"campaignType": "Mobile push"
}
样本2
{
"correlationIds": [
{
"campaignId": "iXyS4z811Rax",
"correlationId": "b316233807ac68675f37787f5dd83871"
}
],
"variantId": 1278915,
"utmCampaign": "",
"ua.os.major": "8"
}
样本3
{
"correlationId": "ls7XmuuiThWzktUeewqgWg",
"eventCreateTime": "2020-05-12T12:40:20.786Z",
"time": 1589287220786,
"modifiedBy": {
"clientId": null,
"clientApiKey": "xxx",
"businessProfileApiKey": null,
"userId": null
},
"campaignType": "Mobile push"
}
样本预期输出
(火花数据帧)
好吧,这是您的“万事通”方法:-)
首先,我们创建一个声明的表变量,并用您的示例填充它以模拟您的问题(请在下次尝试自己提供)
DECLARE@table表(ID INT-IDENTITY,AnyJSON-NVARCHAR(MAX));
插入到@table值中
(不适用){
“correlationId”:“C3xOEEQQCCA9SEX7-u6FA”,
“eventCreateTime”:“2020-05-12T15:38:23.717Z”,
“时间”:1589297903717,
“所有者”:{
“所有者地理学”:{
“城市”:“abc”,
“国家”:“abc”
},
“所有者名称”:{
“名字”:“abc”,
“lastname”:“def”
},
“客户端密钥”:“xxxxx”,
“businessProfileApiKey”:空,
“userId”:空
},
“活动类型”:“移动推送”
}')
,(N'{
“CorrelationId”:[
{
“活动ID”:“iXyS4z811Rax”,
“correlationId”:“b316233807ac68675f37787f5dd83871”
}
],
“variantId”:1278915,
“utmCampaign”:“utmCampaign”,
“ua.os.major”:“8”
}')
,(N'{
“correlationId”:“ls7xMuuithwzktuewqgwg”,
“eventCreateTime”:“2020-05-12T12:40:20.786Z”,
“时间”:158927220786,
“修改人”:{
“clientId”:空,
“客户端密钥”:“xxx”,
“businessProfileApiKey”:空,
“userId”:空
},
“活动类型”:“移动推送”
}');
--询问
将recCTE作为
(
选择ID
,强制转换(1为BIGINT)为ObjectIndex
,强制转换(N'000'整理数据库\u默认为NVARCHAR(MAX))排序字符串
,1为嵌套级别
,CAST(CONCAT(N'Root-',ID,'.')将数据库_默认值作为NVARCHAR(MAX))整理为JsonPath
,将(N'$'COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))转换为JsonKey
,将(AnyJSON COLLATE DATABASE_默认为NVARCHAR(MAX))转换为JsonValue
,CAST(当ISJSON(AnyJSON)=1时,则AnyJSON将数据库_defaultelse NULL END作为NVARCHAR(MAX))整理为NestedJSON
来自@table t
联合所有
选择r.ID
,第_行编号()超过(订单编号(选择空))
,强制转换(CONCAT(r.SortString,STR(ROW_NUMBER()OVER(ORDER BY(SELECT NULL)),3))为NVARCHAR(MAX))
,r.NestLevel+1
,CAST(CONCAT(r.JsonPath,A.[key]+N'.')将数据库\u默认值整理为NVARCHAR(MAX))
,强制转换(一个[key]校对数据库\u默认为NVARCHAR(MAX))
,r.JsonValue整理数据库\u默认值
,强制转换(A.[value]将数据库\u默认值整理为NVARCHAR(MAX))
来自recCTE r
交叉应用OPENJSON(r.NestedJSON)A
其中ISJSON(r.NestedJSON)=1
)
选择ID
,JsonPath
,JsonKey
,嵌套JSON作为JsonValue
来自recCTE
其中ISJSON(NestedJSON)=0
按recCTE.ID排序,排序字符串;
结果
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.correlationId. | correlationId | c3xOeEEQQCCA9sEx7-u6FA |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.eventCreateTime. | eventCreateTime | 2020-05-12T15:38:23.717Z |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.time. | time | 1589297903717 |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownergeography.city. | city | abc |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownergeography.country. | country | abc |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownername.firstname. | firstname | abc |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownername.lastname. | lastname | def |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.clientApiKey. | clientApiKey | xxxxx |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.campaignType. | campaignType | Mobile push |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.correlationIds.0.campaignId. | campaignId | iXyS4z811Rax |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.correlationIds.0.correlationId. | correlationId | b316233807ac68675f37787f5dd83871 |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.variantId. | variantId | 1278915 |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.utmCampaign. | utmCampaign | |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.ua.os.major. | ua.os.major | 8 |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.correlationId. | correlationId | ls7XmuuiThWzktUeewqgWg |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.eventCreateTime. | eventCreateTime | 2020-05-12T12:40:20.786Z |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.time. | time | 1589287220786 |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.modifiedBy.clientApiKey. | clientApiKey | xxx |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.campaignType. | campaignType | Mobile push |
+---+----------------------------------------+-----------------+----------------------------------+
简而言之:
- 我们使用一个递归的CTE来实现这一点
- 查询将测试来自
的任何片段(OPENJSON
)是否为有效的JSON[value]
- 如果片段是有效的,这将走得越来越深
- 获取最终排序顺序需要列
SortString
DECLARE@table表(ID INT-IDENTITY,AnyJSON-NVARCHAR(MAX));
插入到@table值中
(不适用){
“correlationId”:“C3xOEEQQCCA9SEX7-u6FA”,
“eventCreateTime”:“2020-05-12T15:38:23.717Z”,
“时间”:1589297903717,
“所有者”:{
“所有者地理学”:{
“城市”:“abc”,