Pyspark 不包括'；t在AWS胶水ELT作业s3连接中工作_Pyspark_Aws Glue

Pyspark 不包括'；t在AWS胶水ELT作业s3连接中工作

pyspark

Pyspark 不包括'；t在AWS胶水ELT作业s3连接中工作,pyspark,aws-glue,Pyspark,Aws Glue,根据AWS Glue文档，当连接类型为s3时，我们可以使用exlusions排除文件： “排除”：（可选）包含要排除的Unix样式glob模式的JSON列表的字符串。例如，“[\”**.pdf\”]”排除所有pdf文件。有关AWS Glue支持的glob语法的更多信息，请参阅包含和排除模式我的s3存储桶如下所示，我想排除test1文件夹 /mykkkkkk-test test1/ testfolder/ 11.json 22.json

根据AWS Glue文档，当连接类型为

s3

时，我们可以使用

exlusions

排除文件：

“排除”：（可选）包含要排除的Unix样式glob模式的JSON列表的字符串。例如，“[\”**.pdf\”]”排除所有pdf文件。有关AWS Glue支持的glob语法的更多信息，请参阅包含和排除模式

我的s3存储桶如下所示，我想排除test1文件夹

/mykkkkkk-test
   test1/
      testfolder/
         11.json
         22.json
   test2/
      1.json
   test3/
      2.json
   test4/
      3.json
   test5/
      4.json

我使用下面的代码来排除test1文件夹，但它仍将ETL文件放在我的test1文件夹下，并且不起作用

datasource0 = glueContext.create_dynamic_frame_from_options("s3",
    {'paths': ["s3://mykkkkkk-test/"],
    'exclusions': "[\"test1/**\"]",
    'recurse':True,
    'groupFiles': 'inPartition',
    'groupSize': '1048576'}, 
    format="json",
    transformation_ctx = "datasource0")

排除

在ETL pyspark脚本中真的有效吗？我也试过跟随，但都不管用

'exclusions': "[\"test1/**\"]",
'exclusions': ["test1/**"],
'exclusions': "[\"test1\"]",

尝试使用完整路径进行排除

datasource0 = glueContext.create_dynamic_frame.from_options(
's3',
{
    "paths": [
        's3://bucket/sample_data/'
    ],
    "recurse" : True,
    "exclusions" :  "[\"s3://bucket/sample_data/temp/**\"]"
},
"json",
transformation_ctx = "datasource0")

尝试使用完整路径进行排除

datasource0 = glueContext.create_dynamic_frame.from_options(
's3',
{
    "paths": [
        's3://bucket/sample_data/'
    ],
    "recurse" : True,
    "exclusions" :  "[\"s3://bucket/sample_data/temp/**\"]"
},
"json",
transformation_ctx = "datasource0")