Amazon web services 在AWS Glue中读取分区Avro文件_Amazon Web Services_Aws Glue

Amazon web services 在AWS Glue中读取分区Avro文件

amazon-web-services

Amazon web services 在AWS Glue中读取分区Avro文件,amazon-web-services,aws-glue,Amazon Web Services,Aws Glue,例如，我有一个bucket，它在Avro中有大量数据，并以“hive”样式进行分区 s3://my-bucket/year=2018/month=03/day=25/file-name.avro 我试图通过以下方式访问Glue中的此数据： val predicate = "year=2018 and month=03" val opts = JsonOptions("""{ "paths": ["s3://my-bucket/"], "recurse": true }""") val src

例如，我有一个bucket，它在Avro中有大量数据，并以“hive”样式进行分区

s3://my-bucket/year=2018/month=03/day=25/file-name.avro

我试图通过以下方式访问Glue中的此数据：

val predicate = "year=2018 and month=03"
val opts = JsonOptions("""{ "paths": ["s3://my-bucket/"], "recurse": true }""")
val src = glueContext.getSource(connectionType = "s3"
                               , connectionOptions = opts
                               , pushDownPredicate = predicate
                               ).withFormat("avro")

但此表达式失败，但有一个例外：

com.amazonaws.services.glue.util.NonFatalException: User's pushdown predicate: year=2018 and month=03 can not be resolved against partition columns: []

我试过这样的方法：

val predicate = "year=2018 and month=3"
val opts = JsonOptions("""{ "paths": ["s3://my-bucket/"], "recurse": true }""")
val src = glueContext.getSourceWithFormat(connectionType = "s3", format="avro", options = opts, pushDownPredicate = predicate)

val paths = Array(
    "s3://bucket/data/year=2018/month=03",
    "s3://bucket/data/year=2018/month=04"
)

val source = glueContext.getSourceWithFormat(
  connectionType = "s3",
  format = "avro",
  options = JsonOptions(Map(
    "paths" -> paths,
    "recurse": true
))).getDynamicFrame()

但它根本不接受下推谓词：

error: unknown parameter name: pushDownPredicate

我还试图补充一点

"partitionKeys": ["year", "month", "day"]

到

JsonOptions

，也未成功

在没有爬虫程序的情况下，如何在Glue中读取hive分区的Avro序列化数据？

目前无法在

getSource（）

和

getSourceWithFormat（）

中使用下推谓词，因为它在内部验证表达式中的字段是否确实是分区。在

getCatalogSource（）

中，它从Glue目录加载此信息并传递给验证器。对于

getSource（）

和

getSourceWithFormat（）

不可能传递用于验证的数据分区的自定义列表，因此不可能使用下推谓词

作为一种解决方法，您可以生成包含数据分区的路径，并通过

options

在

getSourceWithFormat（）

中传递它。例如，如果要加载

年=2018和（月=03或月=04）

的数据，则代码应如下所示：

val predicate = "year=2018 and month=3"
val opts = JsonOptions("""{ "paths": ["s3://my-bucket/"], "recurse": true }""")
val src = glueContext.getSourceWithFormat(connectionType = "s3", format="avro", options = opts, pushDownPredicate = predicate)

val paths = Array(
    "s3://bucket/data/year=2018/month=03",
    "s3://bucket/data/year=2018/month=04"
)

val source = glueContext.getSourceWithFormat(
  connectionType = "s3",
  format = "avro",
  options = JsonOptions(Map(
    "paths" -> paths,
    "recurse": true
))).getDynamicFrame()

请注意，

source

DynamicFrame不包含分区列

year

和

month

，因此您可能需要手动添加它们

最好的选择是在

my_bucket

上运行爬虫程序，然后使用

粘合上下文。从目录创建动态框架( database=“my\u S3\u data\u set”， table_name=“目录数据表”，下推（谓词=谓词）

您可以共享您的glue目录表模式吗？它是否正确定义了分区列？您可以在Glue catalog中手动创建表，并指定要使用的serde。您也可以使用Athena，在Athena查询中，您需要指定分区列。