使用带过滤器和附加列的pyspark将CSV转换为JSON
给定带有标题的CSV输入文件:使用带过滤器和附加列的pyspark将CSV转换为JSON,json,join,pyspark,apache-spark-sql,pyspark-dataframes,Json,Join,Pyspark,Apache Spark Sql,Pyspark Dataframes,给定带有标题的CSV输入文件: "CorrelationID", "Message", "EventTimeStamp", "Flag", "RandomColumns" 12345, "Hello", "2019-06-09 04:25:15", "True", "blah" 12345, "Hello&q
"CorrelationID", "Message", "EventTimeStamp", "Flag", "RandomColumns"
12345, "Hello", "2019-06-09 04:25:15", "True", "blah"
12345, "Hello", "2019-06-09 04:25:18", "False", "blah"
45678, "Brick", "2019-06-09 04:26:23", "True", "blah"
78912, "Stone", "2019-06-09 04:29:50", "False", "blah"
只考虑那些同时具有true和false标志的CorrelationID。忽略不包含“flag”列的“true”和“false”值的其余行
EventTimeStamp
真标志的值为StartTime
,而EventTimeStamp
假标志的值为EndTime
JSON文件格式作为输出:
{"CorrelationID": "12345","Message":"Hello","StartTime":"2019-06-09 04:25:15","EndTime":"2019-06-09 04:25:18"}
使用
groupBy
&内部agg
功能使用collect\u set
,first
和last
功能。检查下面的代码
from pyspark.sql import functions as F
这是一个值得重新讨论的好问题!
df \
.withColumn(\ # casting eventtimestamp to timestamp
"eventtimestamp", \
F.col("eventtimestamp").cast("timestamp")\
) \
.orderBy(F.col("eventtimestamp").asc) \ # sorting eventtimestamp asc
.groupBy(F.col("correlationid"),F.col("Message")) \ # grouping records based on correlationid
.agg( \
F.first(F.col("eventtimestamp")).cast("string").alias("StartTime"),\ # First value of eventtimestamp as StartTime
F.last(F.col("eventtimestamp")).cast("string").alias("EndTime"), \ # Last value of eventtimestamp as End Time
F.collect_set(F.col("flag")).alias("flag")\ # Collecting Set Of flags & Use size of this value in filter to get only records which has true and false for correlationid.
) \
.filter(F.size(F.col("flag")) === 2) \
.select( \
F.to_json(\ # Adding required columns to inside struct to make json record
F.struct(\
F.col("CorrelationID"),\
F.col("Message"), \
F.col("StartTime"), \
F.col("EndTime") \
).alias("json_data")\
) \
) \
.show(false)
+-------------------------------------------------------------------------------------------------------------+
|json_data |
+-------------------------------------------------------------------------------------------------------------+
|{"CorrelationID":"12345","Message":"Hello","StartTime":"2019-06-09 04:25:15","EndTime":"2019-06-09 04:25:18"}|
+-------------------------------------------------------------------------------------------------------------+