Mysql AWS胶水爬虫在湖中创建空表

Mysql AWS胶水爬虫在湖中创建空表,mysql,amazon-web-services,jdbc,aws-glue,aws-lake-formation,Mysql,Amazon Web Services,Jdbc,Aws Glue,Aws Lake Formation,我已经成功地将MySQL RDS数据库中的数据摄取到带有湖泊形成蓝图的S3存储桶中 检查数据后,大约有41/60的表格被正确摄取 Bug搜索揭示了两件事: 由于蓝图/工作流中的此错误,我的蓝图工作流未摄取所有表: 调用o319.pyWriteDynamicFrame时出错。未知类型 '245在二进制编码结果集中的第9列(共14列) 正在创建缺少的表,但其中没有数据。通过检查JSON表属性,这将由初始爬网执行 我已经从第1点理解了这个错误 以前有人遇到过这样的问题吗?我没有在Glue上编辑JDBC

我已经成功地将MySQL RDS数据库中的数据摄取到带有湖泊形成蓝图的S3存储桶中

检查数据后,大约有41/60的表格被正确摄取

Bug搜索揭示了两件事:

  • 由于蓝图/工作流中的此错误,我的蓝图工作流未摄取所有表:
  • 调用o319.pyWriteDynamicFrame时出错。未知类型 '245在二进制编码结果集中的第9列(共14列)

  • 正在创建缺少的表,但其中没有数据。通过检查JSON表属性,这将由初始爬网执行
  • 我已经从第1点理解了这个错误

    以前有人遇到过这样的问题吗?我没有在Glue上编辑JDBC驱动程序的经验,因为文档总是很差

    我是否缺少一个明显的解决方法

    以下是已成功接收的表(successful_table)的JSON表属性:

    {
         "Name": "rds_DB_successful_table",
         "DatabaseName": "rds-ingestion",
         "CreateTime": "2020-06-23T14:07:04.000Z",
         "UpdateTime": "2020-06-23T14:07:20.000Z",
         "Retention": 0,
         "StorageDescriptor": {
              "Columns": [
                   {
                        "Name": "updated_at",
                        "Type": "timestamp"
                   },
                   {
                        "Name": "name",
                        "Type": "string"
                   },
                   {
                        "Name": "created_at",
                        "Type": "timestamp"
                   },
                   {
                        "Name": "id",
                        "Type": "int"
                   }
              ],
              "Location": "s3://XXX-data-lake/DB/rds_DB_successful_tableversion_0/",
              "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
              "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
              "Compressed": false,
              "NumberOfBuckets": 0,
              "SerdeInfo": {
                   "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
                   "Parameters": {
                        "serialization.format": "1"
                   }
              },
              "SortColumns": [],
              "StoredAsSubDirectories": false
         },
         "TableType": "EXTERNAL_TABLE",
         "Parameters": {
              "CreatedByJob": "RDSCONNECTOR_etl_4_b968999a",
              "CreatedByJobRun": "jr_37cc04c6fd928b9ff7a77fd50d6a98397a30c08ce3d56fae3fd618594585daea",
              "LastTransformCompletedOn": "2020-06-23 14:07:20.508091",
              "LastUpdatedByJob": "RDSCONNECTOR_etl_4_b968999a",
              "LastUpdatedByJobRun": "jr_37cc04c6fd928b9ff7a77fd50d6a98397a30c08ce3d56fae3fd618594585daea",
              "SourceConnection": "RDS Connection Type",
              "SourceTableName": "DB_successful_table",
              "SourceType": "JDBC",
              "TableVersion": "0",
              "TransformTime": "0:00:15.347357",
              "classification": "PARQUET"
         },
         "IsRegisteredWithLakeFormation": true
    }
    
    
    以下是未成功接收但已创建的表(bad_表)的JSON表属性:

    {
         "Name": "_rds_DB_bad_table",
         "DatabaseName": "rds-ingestion",
         "Owner": "owner",
         "CreateTime": "2020-06-23T13:44:19.000Z",
         "UpdateTime": "2020-06-23T13:44:19.000Z",
         "LastAccessTime": "2020-06-23T13:44:19.000Z",
         "Retention": 0,
         "StorageDescriptor": {
              "Columns": [
                   {
                        "Name": "office_id",
                        "Type": "int"
                   },
                   {
                        "Name": "updated_at",
                        "Type": "timestamp"
                   },
                   {
                        "Name": "created_at",
                        "Type": "timestamp"
                   },
                   {
                        "Name": "id",
                        "Type": "int"
                   },
                   {
                        "Name": "position",
                        "Type": "int"
                   },
                   {
                        "Name": "id",
                        "Type": "int"
                   },
                   {
                        "Name": "deadline",
                        "Type": "date"
                   }
              ],
              "Location": "DB.bad_table",
              "Compressed": false,
              "NumberOfBuckets": -1,
              "SerdeInfo": {
                   "Parameters": {}
              },
              "BucketColumns": [],
              "SortColumns": [],
              "Parameters": {
                   "CrawlerSchemaDeserializerVersion": "1.0",
                   "CrawlerSchemaSerializerVersion": "1.0",
                   "UPDATED_BY_CRAWLER": "RDSCONNECTOR_discoverer_57904714",
                   "classification": "mysql",
                   "compressionType": "none",
                   "connectionName": "RDS Connection Type",
                   "typeOfData": "table"
              },
              "StoredAsSubDirectories": false
         },
         "PartitionKeys": [],
         "TableType": "EXTERNAL_TABLE",
         "Parameters": {
              "CrawlerSchemaDeserializerVersion": "1.0",
              "CrawlerSchemaSerializerVersion": "1.0",
              "UPDATED_BY_CRAWLER": "RDSCONNECTOR_discoverer_57904714",
              "classification": "mysql",
              "compressionType": "none",
              "connectionName": "RDS Connection Type",
              "typeOfData": "table"
         },
         "CreatedBy": "arn:aws:sts::724135113484:assumed-role/LakeFormationWorkflowRole/AWS-Crawler",
         "IsRegisteredWithLakeFormation": false
    }
    
    也许比较这些成功和不成功的JSON表属性是关键

    任何帮助都将不胜感激