使用带过滤器和附加列的pyspark将CSV转换为JSON

使用带过滤器和附加列的pyspark将CSV转换为JSON,json,join,pyspark,apache-spark-sql,pyspark-dataframes,Json,Join,Pyspark,Apache Spark Sql,Pyspark Dataframes,给定带有标题的CSV输入文件: "CorrelationID", "Message", "EventTimeStamp", "Flag", "RandomColumns" 12345, "Hello", "2019-06-09 04:25:15", "True", "blah" 12345, "Hello&q

给定带有标题的CSV输入文件:

"CorrelationID", "Message", "EventTimeStamp", "Flag", "RandomColumns"
12345, "Hello", "2019-06-09 04:25:15", "True", "blah"
12345, "Hello", "2019-06-09 04:25:18", "False", "blah"
45678, "Brick", "2019-06-09 04:26:23", "True", "blah"
78912, "Stone", "2019-06-09 04:29:50", "False", "blah"
只考虑那些同时具有true和false标志的CorrelationID。忽略不包含“flag”列的“true”和“false”值的其余行

EventTimeStamp
真标志的值为
StartTime
,而
EventTimeStamp
假标志的值为
EndTime

JSON文件格式作为输出:

{"CorrelationID": "12345","Message":"Hello","StartTime":"2019-06-09 04:25:15","EndTime":"2019-06-09 04:25:18"}

使用
groupBy
&内部
agg
功能使用
collect\u set
first
last
功能。检查下面的代码

from pyspark.sql import functions as F

这是一个值得重新讨论的好问题!
df \
.withColumn(\ # casting eventtimestamp to timestamp
    "eventtimestamp", \
    F.col("eventtimestamp").cast("timestamp")\
) \
.orderBy(F.col("eventtimestamp").asc) \ # sorting eventtimestamp asc
.groupBy(F.col("correlationid"),F.col("Message")) \ # grouping records based on correlationid
.agg( \
    F.first(F.col("eventtimestamp")).cast("string").alias("StartTime"),\ # First value of eventtimestamp as StartTime
    F.last(F.col("eventtimestamp")).cast("string").alias("EndTime"), \ # Last value of eventtimestamp as End Time
    F.collect_set(F.col("flag")).alias("flag")\ # Collecting Set Of flags & Use size of this value in filter to get only records which has true and false for correlationid.
) \
.filter(F.size(F.col("flag")) === 2) \ 
.select( \
    F.to_json(\ # Adding required columns to inside struct to make json record
        F.struct(\
            F.col("CorrelationID"),\
            F.col("Message"), \
            F.col("StartTime"), \
            F.col("EndTime") \
        ).alias("json_data")\
    ) \
) \
.show(false)
+-------------------------------------------------------------------------------------------------------------+
|json_data                                                                                                    |
+-------------------------------------------------------------------------------------------------------------+
|{"CorrelationID":"12345","Message":"Hello","StartTime":"2019-06-09 04:25:15","EndTime":"2019-06-09 04:25:18"}|
+-------------------------------------------------------------------------------------------------------------+