Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/amazon-web-services/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Amazon web services 问题:在AWS Glue中删除具有空值的行_Amazon Web Services_Apache Spark_Pyspark_Amazon Redshift_Aws Glue - Fatal编程技术网

Amazon web services 问题:在AWS Glue中删除具有空值的行

Amazon web services 问题:在AWS Glue中删除具有空值的行,amazon-web-services,apache-spark,pyspark,amazon-redshift,aws-glue,Amazon Web Services,Apache Spark,Pyspark,Amazon Redshift,Aws Glue,目前,AWS胶水作业读取S3集合并将其写入AWS Redshift时出现问题,其中有一列的值为null 作业应该相当简单,大多数代码都是由Glue接口自动生成的,但由于红移中没有空列,这些列在数据集中有时为空,因此无法完成作业 代码的压缩版本如下所示,代码是Python,环境是PySpark datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_1", table_name = "table_1",

目前,AWS胶水作业读取S3集合并将其写入AWS Redshift时出现问题,其中有一列的值为
null

作业应该相当简单,大多数代码都是由Glue接口自动生成的,但由于红移中没有空列,这些列在数据集中有时为空,因此无法完成作业

代码的压缩版本如下所示,代码是Python,环境是PySpark

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_1", table_name = "table_1", transformation_ctx = "datasource0")

resolvedDDF = datasource0.resolveChoice(specs = [
  ('price_current','cast:double'),
  ('price_discount','cast:double'),
])

applymapping = ApplyMapping.apply(frame = resolvedDDF, mappings = [
  ("id", "string", "id", "string"), 
  ("status", "string", "status", "string"), 
  ("price_current", "double", "price_current", "double"), 
  ("price_discount", "double", "price_discount", "double"), 
  ("created_at", "string", "created_at", "string"), 
  ("updated_at", "string", "updated_at", "string"), 
], transformation_ctx = "applymapping")

droppedDF = applymapping.toDF().dropna(subset=('created_at', 'price_current'))

newDynamicDF = DynamicFrame.fromDF(droppedDF, glueContext, "newframe")

dropnullfields = DropNullFields.apply(frame = newDynamicDF, transformation_ctx = "dropnullfields")

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields, catalog_connection = "RedshiftDataStaging", connection_options = {"dbtable": "dbtable_1", "database": "database_1"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink")
我们对当前的
price_
表中以红移方式创建的
有一个非空约束,并且由于我们系统中的一些早期错误,一些记录在没有所需数据的情况下到达S3存储桶。我们只想删除这些行,因为它们在要处理的总体数据中只占很小的百分比

尽管有
dropna
代码,我们仍然从红移中得到以下错误

Error (code 1213) while loading data into Redshift: "Missing data for not-null field"
Table name: "PUBLIC".table_1
Column name: created_at
Column type: timestampt(0)
Raw field value: @NULL@

如果不希望删除默认值,可以传递它们

df= dropnullfields.toDF()

df = df.na.fill({'price_current': 0.0, 'created_at': ' '})

dyf = DynamicFrame.fromDF(df,'glue_context_1')

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dyf, catalog_connection = "RedshiftDataStaging", connection_options = {"dbtable": "dbtable_1", "database": "database_1"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink")
如果要删除,请使用以下代码代替
df.na.fill

df = df.na.drop(subset=["price_current", "created_at"])

applymapping.toDF().dropna(subset=('created_at','price_current'))之后,对Redshift的写入仍然失败。
code我有一个错误,但在我问题的最底层:
error(code 1213)在将数据加载到Redshift中时
df=df.na.fill({'price_current':0.0,'created_at':'')
将null替换为默认值。错误代码1213表示红移列没有NULL约束,但您正在向它传递NULL值。通过使用上述代码,您可以在插入到红移之前用默认的NOTNULL值替换NULL。