Amazon web services 如何将数据管道定义从EMR 3.x升级到4.x/5.x?
我想将我的AWS数据管道定义升级到,这样我就可以利用Hive的最新功能(版本2.0+),例如Amazon web services 如何将数据管道定义从EMR 3.x升级到4.x/5.x?,amazon-web-services,amazon-emr,elastic-map-reduce,amazon-data-pipeline,Amazon Web Services,Amazon Emr,Elastic Map Reduce,Amazon Data Pipeline,我想将我的AWS数据管道定义升级到,这样我就可以利用Hive的最新功能(版本2.0+),例如CURRENT_DATE和CURRENT_TIMESTAMP,等等 在EmrCluster中,与amiVersion相比,从EMR 3.x到4.x/5.x的变化 当我使用“releaseLabel”:“emr-4.1.0”时,我得到以下错误:失败:执行错误,从org.apache.hadoop.hive.ql.exec.tez.TezTask返回代码1 下面是我对EMR 3.x的数据管道定义它工作得很好,
CURRENT_DATE
和CURRENT_TIMESTAMP
,等等
在EmrCluster
中,与amiVersion
相比,从EMR 3.x到4.x/5.x的变化
当我使用“releaseLabel”:“emr-4.1.0”时,我得到以下错误:失败:执行错误,从org.apache.hadoop.hive.ql.exec.tez.TezTask返回代码1
下面是我对EMR 3.x的数据管道定义它工作得很好,所以我希望其他人觉得这很有用(包括emr 4.x/5.x的答案),因为从文件导入数据到DynamoDB的常见答案/建议是使用数据管道,但实际上没有人提出一个可靠且简单的工作示例(比如自定义数据格式)。
{
"objects": [
{
"type": "DynamoDBDataNode",
"id": "DynamoDBDataNode1",
"name": "OutputDynamoDBTable",
"dataFormat": {
"ref": "DynamoDBDataFormat1"
},
"region": "us-east-1",
"tableName": "testImport"
},
{
"type": "Custom",
"id": "Custom1",
"name": "InputCustomFormat",
"column": [
"firstName", "lastName"
],
"columnSeparator" : "|",
"recordSeparator" : "\n"
},
{
"type": "S3DataNode",
"id": "S3DataNode1",
"name": "InputS3Data",
"directoryPath": "s3://data.domain.com",
"dataFormat": {
"ref": "Custom1"
}
},
{
"id": "Default",
"name": "Default",
"scheduleType": "ondemand",
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://logs.data.domain.com"
},
{
"type": "HiveActivity",
"id": "HiveActivity1",
"name": "S3ToDynamoDBImportActivity",
"output": {
"ref": "DynamoDBDataNode1"
},
"input": {
"ref": "S3DataNode1"
},
"hiveScript": "INSERT OVERWRITE TABLE ${output1} SELECT reflect('java.util.UUID', 'randomUUID') as uuid, TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())) as loadDate, firstName, lastName FROM ${input1};",
"runsOn": {
"ref": "EmrCluster1"
}
},
{
"type": "EmrCluster",
"name": "EmrClusterForImport",
"id": "EmrCluster1",
"coreInstanceType": "m1.medium",
"coreInstanceCount": "1",
"masterInstanceType": "m1.medium",
"amiVersion": "3.11.0",
"region": "us-east-1",
"terminateAfter": "1 Hours"
},
{
"type": "DynamoDBDataFormat",
"id": "DynamoDBDataFormat1",
"name": "OutputDynamoDBDataFormat",
"column": [
"uuid", "loadDate", "firstName", "lastName"
]
}
],
"parameters": []
}
示例文件可能看起来像
John|Doe
Jane|Doe
Carl|Doe
奖金:如何在hiveScript部分设置为变量,而不是在列中设置当前日期
?我试过SET loadDate=CURRENT\u DATE\n\n INSERT OVERWRITE…
无效。我的示例中没有显示我想在query子句之前设置的其他动态字段