Amazon data pipeline AWS数据管道:将CSV文件从S3上传到DynamoDB

Amazon data pipeline AWS数据管道:将CSV文件从S3上传到DynamoDB,amazon-data-pipeline,aws-data-pipeline,Amazon Data Pipeline,Aws Data Pipeline,我正在尝试使用数据管道将CSV数据从S3迁移到DynamoDB。数据不是DynamoDB导出格式,而是普通CSV格式 我知道数据管道通常用作DynamoDB格式的导入或导出,而不是标准的CSV。我想我在谷歌上读到过,可以使用普通的文件,但我还没能把一些有用的东西放在一起。AWS文档也没有太大的帮助。我还没有找到相对较新的

我正在尝试使用数据管道将CSV数据从S3迁移到DynamoDB。数据不是DynamoDB导出格式,而是普通CSV格式

我知道数据管道通常用作DynamoDB格式的导入或导出,而不是标准的CSV。我想我在谷歌上读到过,可以使用普通的文件,但我还没能把一些有用的东西放在一起。AWS文档也没有太大的帮助。我还没有找到相对较新的<2岁的参考帖子

如果这是可能的,谁能提供一些关于为什么我的管道可能无法工作的见解?我已经在下面粘贴了管道和错误消息。这个错误似乎表明了将数据插入Dynamo的问题,我猜是因为它不是导出格式

我会在Lambda中完成,但数据加载需要超过15分钟

谢谢

{
  "objects": [
    {
      "myComment": "Activity used to run the hive script to import CSV data",
      "output": {
        "ref": "dynamoDataTable"
      },
      "input": {
        "ref": "s3csv"
      },
      "name": "S3toDynamoLoader",
      "hiveScript": "DROP TABLE IF EXISTS tempHiveTable;\n\nDROP TABLE IF EXISTS s3TempTable;\n\nCREATE EXTERNAL TABLE tempHiveTable (#{myDDBColDef}) \nSTORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' \nTBLPROPERTIES (\"dynamodb.table.name\" = \"#{myDDBTableName}\", \"dynamodb.column.mapping\" = \"#{myDDBTableColMapping}\");\n                    \nCREATE EXTERNAL TABLE s3TempTable (#{myS3ColDef}) \nROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\\n' LOCATION '#{myInputS3Loc}';\n                    \nINSERT OVERWRITE TABLE tempHiveTable SELECT * FROM s3TempTable;",
      "id": "S3toDynamoLoader",
      "runsOn": { "ref": "EmrCluster" },
      "stage": "false",
      "type": "HiveActivity"
    },
    {
      "myComment": "The DynamoDB table that we are uploading to",
      "name": "DynamoDB",
      "id": "dynamoDataTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}",
      "writeThroughputPercent": "1.0",
      "dataFormat": {
        "ref": "DDBTableFormat"
      }
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "#{myLogUri}",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "name": "EmrCluster",
      "coreInstanceType": "m1.medium",
      "coreInstanceCount": "1",
      "masterInstanceType": "m1.medium",
      "releaseLabel": "emr-5.29.0",
      "id": "EmrCluster",
      "type": "EmrCluster",
      "terminateAfter": "2 Hours"
    },
    {
      "myComment": "The S3 file that contains the data we're importing",
      "directoryPath": "#{myInputS3Loc}",
      "dataFormat": {
        "ref": "csvFormat"
      },
      "name": "S3DataNode",
      "id": "s3csv",
      "type": "S3DataNode"
    },
    {
      "myComment": "Format for the S3 Path",
      "name": "S3ExportFormat",
      "column": "not_used STRING",
      "id": "csvFormat",
      "type": "CSV"
    },
    {
      "myComment": "Format for the DynamoDB table",
      "name": "DDBTableFormat",
      "id": "DDBTableFormat",
      "column": "not_used STRING",
      "type": "DynamoDBExportDataFormat"
    }
  ],
  "parameters": [
    {
      "description": "S3 Column Mappings",
      "id": "myS3ColDef",
      "default": "phoneNumber string,firstName string,lastName string, spend double",
      "type": "String"
    },
    {
      "description": "DynamoDB Column Mappings",
      "id": "myDDBColDef",
      "default": "phoneNumber String,firstName String,lastName String, spend double",
      "type": "String"
    },
    {
      "description": "Input S3 foder",
      "id": "myInputS3Loc",
      "default": "s3://POCproject-dev1-data/upload/",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "DynamoDB table name",
      "id": "myDDBTableName",
      "default": "POCproject-pipeline-data",
      "type": "String"
    },
    {
      "description": "S3 to DynamoDB Column Mapping",
      "id": "myDDBTableColMapping",
      "default": "phoneNumber:phoneNumber,firstName:firstName,lastName:lastName,spend:spend",
      "type": "String"
    },
    {
      "description": "DataPipeline Log Uri",
      "id": "myLogUri",
      "default": "s3://POCproject-dev1-data/",
      "type": "AWS::S3::ObjectKey"
    }
  ]
}
错误


你试过这个样品了吗?它使用配置单元将CSV文件导入DynamoDB表

我的脚本就是基于这个,我从来没有在没有出错的情况下让它工作过。我不确定哪里出了错。
[INFO] (TaskRunnerService-df-09432511OLZUA8VN0NLE_@EmrCluster_2020-03-06T02:52:47-0) df-09432511OLZUA8VN0NLE amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
    at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:108)
    at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
    at org.apache.hadoop.dynamodb.DynamoDBClient.writeBatch(DynamoDBClient.java:258)
    at org.apache.hadoop.dynamodb.DynamoDBClient.putBatch(DynamoDBClient.java:215)
    at org.apache.hadoop.dynamodb.write.AbstractDynamoDBRecordWriter.write(AbstractDynamoDBRecordWriter.java:112)
    at org.apache.hadoop.hive.dynamodb.write.HiveDynamoDBRecordWriter.write(HiveDynamoDBRecordWriter.java:42)
    at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
    ... 18 more
Caused by: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)