如何使用Azure Data Factory将CosmosDb文档复制到Blob存储(每个文档位于单个json文件中)
我正在尝试使用Azure Data Factoryv2备份我的Cosmos Db存储。一般来说,它正在完成它的工作,但我希望Cosmos集合中的每个文档都对应blobs存储中的新json文件 使用下一个复制参数,我可以将集合中的所有文档复制到azure blob存储中的1个文件中:如何使用Azure Data Factory将CosmosDb文档复制到Blob存储(每个文档位于单个json文件中),azure,azure-cosmosdb,azure-data-factory-2,Azure,Azure Cosmosdb,Azure Data Factory 2,我正在尝试使用Azure Data Factoryv2备份我的Cosmos Db存储。一般来说,它正在完成它的工作,但我希望Cosmos集合中的每个文档都对应blobs存储中的新json文件 使用下一个复制参数,我可以将集合中的所有文档复制到azure blob存储中的1个文件中: { "name": "ForEach_mih", "type": "ForEach", "typeProperties": { "items": { "value": "@pipeline()
{
"name": "ForEach_mih",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "@pipeline().parameters.cw_items",
"type": "Expression"
},
"activities": [
{
"name": "Copy_mih",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"userProperties": [
{
"name": "Source",
"value": "@{item().source.collectionName}"
},
{
"name": "Destination",
"value": "cosmos-backup-v2/@{item().destination.fileName}"
}
],
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"path": "cosmos-backup-logs"
},
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_collectionName": "@item().source.collectionName"
}
}
],
"outputs": [
{
"referenceName": "DestinationDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_fileName": "@item().destination.fileName"
}
}
]
}
]
}
}
如何将每个cosmos文档复制到单独的文件中,并将其命名为{PartitionId}-{docId}
UPD
源代码集:
{
"name": "ClustersData",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_CosmosDb",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "directory-clusters"
}
}
}
目的地设置代码:
{
"name": "OutputClusters",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "",
"folderPath": "cosmos-backup-logs"
}
}
}
管道代码:
{
"name": "copy-clsts",
"properties": {
"activities": [
{
"name": "LookupClst",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"dataset": {
"referenceName": "ClustersData",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachClst",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupClst",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('LookupClst').output.value",
"type": "Expression"
},
"batchCount": 8,
"activities": [
{
"name": "CpyClst",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "select @{item()}",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "ClustersData",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputClusters",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
输入集合中所有相同格式的文档示例:
{
"$type": "Entities.ADCluster",
"DisplayName": "TESTNetBIOS",
"OrgId": "9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"ClusterId": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"AllowLdapLifeCycleSynchronization": true,
"DirectoryServers": [
{
"$type": "Entities.DirectoryServer",
"AddressId": "e6a8edbb-ad56-4135-94af-fab50b774256",
"Port": 389,
"Host": "192.168.342.234"
}
],
"DomainNames": [
"TESTNetBIOS"
],
"BaseDn": null,
"UseSsl": false,
"RepositoryType": 1,
"DirectoryCustomizations": null,
"_etag": "\"140046f2-0000-0000-0000-5ac63a180000\"",
"LastUpdateTime": "2018-04-05T15:00:40.243Z",
"id": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"PartitionKey": "directory-clusters-9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"_rid": "kpvxLAs6gkmsCQAAAAAAAA==",
"_self": "dbs/kvpxAA==/colls/kpvxLAs6gkk=/docs/kvpxALs6kgmsCQAAAAAAAA==/",
"_attachments": "attachments/",
"_ts": 1522940440
}
仅以一个集合为例。 在foreach内部: 并且您的查找和复制活动源数据集引用了相同的cosmosdb数据集
如果要复制5个集合,可以将此管道放入“执行”活动中。执行活动的主管道有一个foreach活动。因为您的cosmosdb有数组,而ADF不支持cosmos db的序列化数组,这是我可以提供的解决方法 首先,将所有文档导出为json文件,并按原样将json导出到blob或ADL或文件系统、任何文件存储。我想你已经知道怎么做了。这样,每个集合都将有一个json文件 其次,处理每个json文件,将文件中的每一行精确到单个文件中 我只为步骤2提供管道。您可以使用执行管道活动链接步骤1和步骤2。您甚至可以使用foreach活动处理步骤2中的所有集合 管道json
{
"name": "pipeline27",
"properties": {
"activities": [
{
"name": "Lookup1",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"dataset": {
"referenceName": "AzureBlob7",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('Lookup1').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select @{item()}",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "DocumentDbCollection1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob6",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "@item().id",
"type": "Expression"
},
"PartitionKey": {
"value": "@item().PartitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
用于查找的数据集json
{
"name": "AzureBlob7",
"properties": {
"linkedServiceName": {
"referenceName": "bloblinkedservice",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "cosmos.json",
"folderPath": "aaa"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
复制的源数据集。实际上,这个数据集没有任何用处。只想用它来承载查询select@{item}
{
"name": "DocumentDbCollection1",
"properties": {
"linkedServiceName": {
"referenceName": "CosmosDB-r8c",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "test"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
目标数据集。通过两个参数,它还解决了您的文件名请求
{
"name": "AzureBlob6",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorage-eastus",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "String"
},
"PartitionKey": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": {
"value": "@{dataset().PartitionKey}-@{dataset().id}.json",
"type": "Expression"
},
"folderPath": "aaacosmos"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
还请注意查找活动的限制:
支持以下数据源进行查找。查找活动可以返回的最大行数为5000,最大大小为2MB。目前,查找活动在超时前的最长持续时间为一小时。您是否考虑过使用Azure函数以不同的方式实现此功能?ADF设计用于将大量数据从一个位置移动到另一个位置,并且每个集合只生成一个文件
您可以考虑在文档中添加/更新文档时触发Azure函数,并使Azure函数将文档输出到BLB存储。这应该可以很好地扩展,并且相对容易实现。
我也在这方面做了一些努力,特别是绕过了查找活动的大小限制,因为我们有很多数据要迁移。最后,我创建了一个JSON文件,其中包含一个时间戳列表,用于查询Cosmos数据,然后针对每个时间戳,获取该范围内的文档ID,然后针对每个时间戳,获取完整文档数据并将其保存到一个路径,如PartitionKey/DocumentID。以下是我创建的管道: LookupTimestamps-循环times.json文件中的每个时间戳范围,并针对每个时间戳执行ExportFromCosmos管道{
"name": "LookupTimestamps",
"properties": {
"activities": [
{
"name": "LookupTimestamps",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"dataset": {
"referenceName": "BlobStorageTimestamps",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachTimestamp",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupTimestamps",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('LookupTimestamps').output.value",
"type": "Expression"
},
"isSequential": false,
"activities": [
{
"name": "Execute Pipeline1",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "ExportFromCosmos",
"type": "PipelineReference"
},
"waitOnCompletion": true,
"parameters": {
"From": {
"value": "@{item().From}",
"type": "Expression"
},
"To": {
"value": "@{item().To}",
"type": "Expression"
}
}
}
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
ExportFromCosmos-从上述管道执行的嵌套管道。这是为了避免不能有嵌套的ForEach活动这一事实
{
"name": "ExportFromCosmos",
"properties": {
"activities": [
{
"name": "LookupDocuments",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select c.id, c.partitionKey from c where c._ts >= @{pipeline().parameters.from} and c._ts <= @{pipeline().parameters.to} order by c._ts desc",
"type": "Expression"
},
"nestingSeparator": "."
},
"dataset": {
"referenceName": "CosmosDb",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachDocument",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupDocuments",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('LookupDocuments').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select * from c where c.id = \"@{item().id}\" and c.partitionKey = \"@{item().partitionKey}\"",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "CosmosDb",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobStorageDocuments",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "@item().id",
"type": "Expression"
},
"partitionKey": {
"value": "@item().partitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
],
"parameters": {
"from": {
"type": "int"
},
"to": {
"type": "int"
}
}
}
}
BlobStorageDocuments-用于保存文档的数据集
{
"name": "BlobStorageDocuments",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "string"
},
"partitionKey": {
"type": "string"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": {
"value": "@{dataset().partitionKey}/@{dataset().id}.json",
"type": "Expression"
},
"folderPath": "mycollection"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
times.json文件只是一个历元时间列表,如下所示:
[{
"From": 1556150400,
"To": 1556236799
},
{
"From": 1556236800,
"To": 1556323199
}]
从您的负载中,我可以看到它是由复制数据工具生成的。这个管道应该已经将每个cosmos文档复制到同一容器中的一个单独文件中。你能再检查一下吗?唯一的差距应该是将它命名为{PartitionId}-{docId}?我确实是使用复制数据工具生成的,但它确实将所有文档合并到一个文件中。我检查了我的blob存储-5个文件中的5个集合,没有更多的lessP.S。我还尝试使用本文中提到的foreach活动类型,但当我尝试在foreach活动中创建复制活动时,这并不缺乏-我无法从foreach中选择@item作为复制活动的源哦,对不起。我误解了收藏是一种文件。目前,它是通过将每个集合设计成单个文件来实现的。不确定你的目标是否得到支持。我认为你提到的帖子的答案应该有效。我会写出来的。非常感谢你的努力,我现在就去试试=我有一个例外专栏:DirectoryServers&44;不支持数据类型Newtonsoft.Json.Linq.JArray。你知道我该怎么解决这个问题吗?实际上,我不想让复制工具解析我的文档。。我只想让他们保持原样。。但当我选中“二进制复制”时,它会说“不允许在基于文件的存储和表格数据集之间进行二进制复制”。如果您使用的是复制数据工具,则会有一个“按原样导出json”复选框。你可以检查一下。如果您没有使用复制数据工具,只需清除模式中的所有列
选项卡,并清理映射选项卡中的所有列映射。当我清理架构并运行管道时,我在运行列时遇到一个异常:DirectoryServers&44;不支持数据类型Newtonsoft.Json.Linq.JArray。原始cosmos文档中的DirectoryServers字段类似于:DirectoryServers:[{$type:Entities.DirectoryServer,地址ID:e6a8edbb-da56-4185-94af-FAB50407256,端口:389,主机:192.168.122.141}顺便说一下,在运行后的数据源中,在源代码中出现了排除的列:非常感谢!最后,通过例子,我能够实现这一点。
[{
"From": 1556150400,
"To": 1556236799
},
{
"From": 1556236800,
"To": 1556323199
}]