Apache spark 为什么Spark在读取压缩的json文件时加载不必要的数据?

Apache spark 为什么Spark在读取压缩的json文件时加载不必要的数据?,apache-spark,pyspark,Apache Spark,Pyspark,我正在使用pyspark处理来自s3的日志文件,并根据日期对它们进行过滤。 这些文件由按年/月/日分区的压缩json文件组成,如下所示: s3://bucket/logs/YYYY/MM/DD/.json.gz 由于分区不遵循HDFS分区语法(year=YYYY/month=MM/day=DD),我正在阅读整个文件夹,并使用input\u file\u name和regex创建列: df=spark.read.option(“compression”,“gzip”).text(“s3a://bu

我正在使用pyspark处理来自s3的日志文件,并根据日期对它们进行过滤。 这些文件由按年/月/日分区的压缩json文件组成,如下所示:
s3://bucket/logs/YYYY/MM/DD/.json.gz

由于分区不遵循HDFS分区语法(year=YYYY/month=MM/day=DD),我正在阅读整个文件夹,并使用
input\u file\u name
和regex创建列:

df=spark.read.option(“compression”,“gzip”).text(“s3a://bucket/logs/*/*/*/*/*/*.json.gz”)
df=df.withColumn(“路径文件”,输入文件名())
df=df.withColumn(“logstash\u date”,regexp\u extract(col('path\u file'),r“(?:s3a:\/\/bucket\/logs\/)(\d{04}\/\d{02}\/\d{02})”,1))
df=df.withColumn(“logstash_date”、regexp_replace(col(“logstash_date”)、“/”、“-”).cast(“date”))
df=df.filter(col(“logstash_date”)>=from_date.date()
#稍后使用from_json解析模式,应用更多过滤器并执行连接(以消除重复日志)
如果使用HDFS语法对日志进行分区,spark将能够在不读取实际数据的情况下对日志进行过滤。
但是,即使我不使用数据本身,spark似乎还是会读取数据。
此处显示来自UI的信息:

逻辑计划似乎很完美,将
项目[值0作为原始数据#15,路径#文件#2]之前的日期过滤掉。

== Parsed Logical Plan ==
'InsertIntoHadoopFsRelationCommand s3a://<REDACTED>, false, ['event_day], Parquet, Map(basePath -> s3a://<REDACTED>, path -> s3a://<REDACTED>), Append, [id, client_id, somos_id, user_id, session_id, user_agent, method, status, path, timestamp, message, controller, action, facility, raw_data, event_day, generated_at]
+- Project [id#22, client_id#23, somos_id#24, user_id#25, session_id#26, user_agent#27, method#28, status#29, path#30, timestamp#65, message#32, controller#33, action#34, facility#35, raw_data#15, event_day#81, 2020-05-12T05:31:32.820718+00:00 AS generated_at#149]
   +- Repartition 10, false
      +- Project [id#22, client_id#23, somos_id#24, user_id#25, session_id#26, user_agent#27, method#28, status#29, path#30, timestamp#65, message#32, controller#33, action#34, facility#35, raw_data#15, event_day#81]
         +- Filter isnull(id#98)
            +- Project [id#22, client_id#23, somos_id#24, user_id#25, session_id#26, user_agent#27, method#28, status#29, path#30, timestamp#65, message#32, controller#33, action#34, facility#35, raw_data#15, event_day#81, id#98]
               +- Join LeftOuter, (id#22 = id#98)
                  :- Filter (((((facility#35 = rails-production) && NOT (controller#33 = PingController)) && isnotnull(path#30)) && (isnotnull(user_id#25) && isnotnull(timestamp#65))) && (((isnotnull(method#28) && NOT (method#28 = HEAD)) && NOT client_id#23 LIKE converge_%) && (NOT controller#33 LIKE Hotsite::% && NOT message#32 LIKE somos_id%)))
                  :  +- Project [id#22, client_id#23, somos_id#24, user_id#25, session_id#26, user_agent#27, method#28, status#29, path#30, timestamp#65, message#32, controller#33, action#34, facility#35, raw_data#15, to_date('timestamp, None) AS event_day#81]
                  :     +- Project [id#22, client_id#23, somos_id#24, user_id#25, session_id#26, user_agent#27, method#28, status#29, path#30, to_timestamp('timestamp, None) AS timestamp#65, message#32, controller#33, action#34, facility#35, raw_data#15]
                  :        +- Project [data#18.id AS id#22, data#18.client_id AS client_id#23, data#18.somos_id AS somos_id#24, data#18.user_id AS user_id#25, data#18.session_id AS session_id#26, data#18.user_agent AS user_agent#27, data#18.method AS method#28, data#18.status AS status#29, data#18.path AS path#30, data#18.timestamp AS timestamp#31, data#18.message AS message#32, data#18.controller AS controller#33, data#18.action AS action#34, data#18.facility AS facility#35, raw_data#15]
                  :           +- Project [raw_data#15, path_file#2, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), raw_data#15, Some(GMT)) AS data#18]
                  :              +- Project [value#0 AS raw_data#15, path_file#2]
                  :                 +- Project [value#0, path_file#2]
                  :                    +- Filter (logstash_date#9 >= 18392)
                  :                       +- Project [value#0, path_file#2, cast(regexp_replace(logstash_date#5, /, -) as date) AS logstash_date#9]
                  :                          +- Project [value#0, path_file#2, regexp_extract(path_file#2, (?:s3a:\/\/bucket\/logs\/)(\d{04}\/\d{02}\/\d{02}), 1) AS logstash_date#5]
                  :                             +- Project [value#0, input_file_name() AS path_file#2]
                  :                                +- Relation[value#0] text
                  +- Project [id#98]
                     +- Filter (event_day#114 >= 18392)
                        +- Relation[id#98,client_id#99,somos_id#100,user_id#101,session_id#102,user_agent#103,method#104,status#105,path#106,timestamp#107,message#108,controller#109,action#110,facility#111,raw_data#112,generated_at#113,event_day#114] parquet
==解析的逻辑计划==
'InsertIntoHadoopFsRelationCommand s3a://,false,['event\u day],拼花地板,地图(basePath->s3a://,path->s3a://),Append,[id,client\u id,somos\u id,user\u id,session\u id,user\u代理,方法,状态,路径,时间戳,消息,控制器,操作,设施,原始数据,event\u day,生成的\u at]
+-项目id 22,项目id 22,客户id 23,somos id 24,用户id 25,项目id 22,项目id 22,项目id 22,项目id 22,项目id 22,项目id 22,客户id 23,客户id 23,客户id 23,索马里id 23,索马里id 23,索摩索id 24,用户id 25,用户id 25,会话id 26,用户代理26,用户id 26,用户代理26,用户id 26,用户代理27,用户代理27,代理27,项目id 26,用户代理27,用户代理27,代理27,项目id 26,用户代理,用户代理27,代理,代理27,27,项目代理,27,27,方案,27,方法28,方法28,方法28,方法,28,28,28,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id,客户id。客户id。客户id。客户id。客户id。客户id。客户id。(u at#149)
+-重新分区10,错误
+-项目[id 22、客户id 23、somos id 24、用户id 25、会话id 26、用户代理27、方法28、状态29、路径30、时间戳65、消息32、控制器33、操作34、设施35、原始数据、事件81]
+-过滤器为空(id#98)
+-项目[id 22、客户id 23、somos id 24、用户id 25、会话id 26、用户代理27、方法28、状态29、路径30、时间戳65、消息32、控制器33、操作34、设施35、原始数据15、事件id 98]
+-连接LeftOuter(id#22=id#98)
:-Filter(((facility#35=rails production)和NOT(controller#33=PingController))和isnotnull(path#30))&(isnotnull(user#id#25)和isnotnull(timestamp#65))&((isnotnull(method#28)和NOT(method#28=HEAD))和NOT client#id#23像converge#%&(NOT controller#像hotos:#32%)和somu id:)
:+-项目[id#22,客户id#23,somos#id#24,用户id#25,会话id#26,用户代理#27,方法#28,状态#29,路径#30,时间戳#65,消息#32,控制器#33,操作#34,设施#35,原始#数据#15,日期#无事件#81]
:+-项目[id#22,客户id#23,somos#id#24,用户id#25,会话id#26,用户代理#27,方法#28,状态#29,路径#30,到时间戳('timestamp,None')作为时间戳#65,消息#32,控制器#33,操作#设施#数据#15]
:+-项目18.客户端id作为客户端id作为客户端id作为客户端id作为客户端id作为客户端id作为客户端id作为客户端id作为客户端id作为客户端id 23,数据18.索马里索马里索马里索马里索马里索马里索马里索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索索;29,数据#18.路径作为路径#30,数据#18.时间戳作为时间戳#31,数据#18.消息作为消息#32,数据#18.控制器作为控制器33,数据18.行动作为行动34,数据18.设施作为设施35,原始数据15]
:+-Project[raw_data#15,path_file#2,jsontostructs(StructField(id,StringType,true),StructField(client_id,StringType,true),StructField(somos_id,StringType,true),StructField(user_id,StringType,true),StructField(user_agent,StringType,true),StructField(method,StringType,true),StructField(状态,StringType,true),StructField(路径,StringType,true),StructField(时间戳,StringType,true),StructField(消息,StringType,true),StructField(控制器,StringType,true),StructField(设施,StringType,true),原始数据#15,一些(GMT))作为数据#18]
:+-Project[value#0作为原始数据#15,path#u file#2]
:+-Project[value#0,path#u file#2]
:+-Filter(logstash#u date#9>=18392)
:+-Project[value#0,path#file#2,cast(regexp#u替换(logstash#u date#5,/,-)作为日期)作为logstash#9]
:+-Project[value#0,path#u file#2,regexp#u extract(path#u file#2,(?:s3a:\/\/bucket\/logs\/)(\d{04}\/\d{02}/\d{02}),1)作为logstash#u date 5]
:+-Project[value#0,输入_file_name()作为路径_file#2]
:+-关系[value#0]文本
+-项目[id#98]
+-过滤器(事件日114>=18392)
+-关系[id#98,客户端#id#99,somos#id#100,用户#id#101,会话#id#102,用户#代理#103,方法#104,状态#105,路径#106,时间戳#107,消息#108,控制器#109,动作#110,设施#111,原始数据,拼花地板#114,在#事件#113天生成#
但在优化的逻辑计划中,两个步骤结合在一起:

== Optimized Logical Plan ==
InsertIntoHadoopFsRelationCommand s3a://<REDACTED>, false, [event_day#81], Parquet, Map(basePath -> s3a://<REDACTED>, path -> s3a://<REDACTED>), Append, [id, client_id, somos_id, user_id, session_id, user_agent, method, status, path, timestamp, message, controller, action, facility, raw_data, event_day, generated_at]
+- Project [id#22, client_id#23, somos_id#24, user_id#25, session_id#26, user_agent#27, method#28, status#29, path#30, timestamp#65, message#32, controller#33, action#34, facility#35, raw_data#15, event_day#81, 2020-05-12T05:31:32.820718+00:00 AS generated_at#149]
   +- Repartition 10, false
      +- Project [id#22, client_id#23, somos_id#24, user_id#25, session_id#26, user_agent#27, method#28, status#29, path#30, timestamp#65, message#32, controller#33, action#34, facility#35, raw_data#15, event_day#81]
         +- Filter isnull(id#98)
            +- Join LeftOuter, (id#22 = id#98)
               :- Project [jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).id AS id#22, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).client_id AS client_id#23, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).somos_id AS somos_id#24, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).user_id AS user_id#25, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).session_id AS session_id#26, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).user_agent AS user_agent#27, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).method AS method#28, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).status AS status#29, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).path AS path#30, cast(jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).timestamp as timestamp) AS timestamp#65, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).message AS message#32, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).controller AS controller#33, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).action AS action#34, jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).facility AS facility#35, value#0 AS raw_data#15, cast(cast(jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).timestamp as timestamp) as date) AS event_day#81]
               :  +- Filter (((((((((((cast(regexp_replace(regexp_extract(path_file#2, (?:s3a:\/\/bucket\/logs\/)(\d{04}\/\d{02}\/\d{02}), 1), /, -) as date) >= 18392) && (jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).facility = rails-production)) && NOT (jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).controller = PingController)) && isnotnull(jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).path)) && isnotnull(jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).user_id)) && isnotnull(cast(jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).timestamp as timestamp))) && isnotnull(jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).method)) && NOT (jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).method = HEAD)) && NOT jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).client_id LIKE converge_%) && NOT StartsWith(jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).controller, Hotsite::)) && NOT jsontostructs(StructField(id,StringType,true), StructField(client_id,StringType,true), StructField(somos_id,StringType,true), StructField(user_id,StringType,true), StructField(session_id,StringType,true), StructField(user_agent,StringType,true), StructField(method,StringType,true), StructField(status,StringType,true), StructField(path,StringType,true), StructField(timestamp,StringType,true), StructField(message,StringType,true), StructField(controller,StringType,true), StructField(action,StringType,true), StructField(facility,StringType,true), value#0, Some(GMT)).message LIKE somos_id%)
               :     +- Project [value#0, input_file_name() AS path_file#2]
               :        +- Relation[value#0] text
               +- Project [id#98]
                  +- Filter ((isnotnull(event_day#114) && (event_day#114 >= 18392)) && isnotnull(id#98))
                     +- Relation[id#98,client_id#99,somos_id#100,user_id#101,session_id#102,user_agent#103,method#104,status#105,path#106,timestamp#107,message#108,controller#109,action#110,facility#111,raw_data#112,generated_at#113,event_day#114]
==优化的逻辑计划==
插入hadoopfsrelationCommand s3a://,false,[event_day#81],拼花地板,地图(basePath->s3a://,path->s3a://),Append,[id,client_id,somos_id,user_id,session_id,user_agent,方法,状态,路径,时间戳,mess
spark_session.read.option("basePath",base_path).json(list_of_file_paths)