Amazon web services AWS粘合作业在启用作业时返回错误

Amazon web services AWS粘合作业在启用作业时返回错误,amazon-web-services,amazon-s3,pyspark,aws-glue,Amazon Web Services,Amazon S3,Pyspark,Aws Glue,我们需要将现有的S3源文件(当前为JSON格式)转换为Parquet,下面是实现相同的步骤 1) 创建了名为“RawEventsToParquet”的粘合作业,该作业将从“raw_事件”(这是一个当前以S3 JSON形式存在的表)读取数据 文件),并将数据格式化为拼花地板,并存储到新的S3存储桶“S3/prepared/events”中。 2) 创建了一个爬虫程序“crawparquetpreparedfiles”,它在上述S3存储桶中爬行,并创建了一个目录“prepared_events”和表

我们需要将现有的S3源文件(当前为JSON格式)转换为Parquet,下面是实现相同的步骤

1) 创建了名为“RawEventsToParquet”的粘合作业,该作业将从“raw_事件”(这是一个当前以S3 JSON形式存在的表)读取数据 文件),并将数据格式化为拼花地板,并存储到新的S3存储桶“S3/prepared/events”中。 2) 创建了一个爬虫程序“crawparquetpreparedfiles”,它在上述S3存储桶中爬行,并创建了一个目录“prepared_events”和表“prepared_highbond_events”。 3) 我们可以查询雅典娜并从准备好的事件中读取数据

我们还希望作业仅为新的S3 JSON文件创建拼花文件,并且不应复制和加载已处理的文件,因此在研究后找到了启用advance property JOB bookmark的选项。它在第一次运行时运行良好,能够从创建的拼花文件中查询数据,但在第二次运行时,理想情况下,它不应该创建任何拼花文件,因为我们没有在源S3中添加任何新文件,但作业失败,出现以下错误。我是AWS的新手,所以我发现很难解决这个问题,非常感谢您的帮助。提前谢谢

分析异常:'\n数据源不支持写入空或嵌套的空架构。\n请确保数据架构至少有一列或多列。\n;'

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "raw_events", table_name = "highbond_events", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "raw_events", table_name = "highbond_events", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("anonymousid", "string", "anonymousid", "string"), ("channel", "string", "channel", "string"), ("context", "struct", "context", "struct"), ("event", "string", "event", "string"), ("integrations", "string", "integrations", "string"), ("messageid", "string", "messageid", "string"), ("originaltimestamp", "string", "originaltimestamp", "string"), ("projectid", "string", "projectid", "string"), ("properties", "struct", "properties", "struct"), ("receivedat", "string", "receivedat", "string"), ("sentat", "string", "sentat", "string"), ("timestamp", "string", "timestamp", "string"), ("type", "string", "type", "string"), ("userid", "string", "userid", "string"), ("version", "int", "version", "int"), ("writekey", "string", "writekey", "string"), ("_metadata", "struct", "_metadata", "struct"), ("category", "string", "category", "string"), ("name", "string", "name", "string"), ("traits", "struct", "traits", "struct"), ("groupid", "string", "groupid", "string"), ("year", "string", "year", "string"), ("month", "string", "month", "string"), ("day", "string", "day", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("anonymousid", "string", "anonymousid", "string"), ("channel", "string", "channel", "string"), ("context", "struct", "context", "struct"), ("event", "string", "event", "string"), ("integrations", "string", "integrations", "string"), ("messageid", "string", "messageid", "string"), ("originaltimestamp", "string", "originaltimestamp", "string"), ("projectid", "string", "projectid", "string"), ("properties", "struct", "properties", "struct"), ("receivedat", "string", "receivedat", "string"), ("sentat", "string", "sentat", "string"), ("timestamp", "string", "timestamp", "string"), ("type", "string", "type", "string"), ("userid", "string", "userid", "string"), ("version", "int", "version", "int"), ("writekey", "string", "writekey", "string"), ("_metadata", "struct", "_metadata", "struct"), ("category", "string", "category", "string"), ("name", "string", "name", "string"), ("traits", "struct", "traits", "struct"), ("groupid", "string", "groupid", "string"), ("year", "string", "year", "string"), ("month", "string", "month", "string"), ("day", "string", "day", "string")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://acl-playground-grc-cleansed-data/prepared/events"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://acl-playground-grc-cleansed-data/prepared/events"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

当您第二次运行作业时,能否确认动态帧中是否有任何数据?我认为当您启用书签时,可能不会出现新数据,并且您是如何编写数据的?首先,我感谢您的回复和帮助。您评论的答案是下一次运行,没有新文件加载到源S3存储桶,是的,您的假设在这里是正确的。我不确定我是否完全理解你的问题“你是如何编写数据的”,我假设你在问关于目标的问题,目标是S3 bucket,文件保存为拼花格式,其中as source也是S3 bucket,但包含JSON文件。你能用你使用的脚本更新你的帖子吗?当然,让我来吧,我已经把剧本贴出来了。