Dataframe 将数据集中具有键值对的列转换为不同的行
我有一个数据帧中的数据,它是从azure eventhub获得的。 然后我将此数据转换为json对象,并将所需数据存储到数据集中,如下所示Dataframe 将数据集中具有键值对的列转换为不同的行,dataframe,apache-spark,apache-spark-sql,apache-spark-dataset,Dataframe,Apache Spark,Apache Spark Sql,Apache Spark Dataset,我有一个数据帧中的数据,它是从azure eventhub获得的。 然后我将此数据转换为json对象,并将所需数据存储到数据集中,如下所示 +-----------------+--------------------+--------------------+--------------------+--------------------+ | NUM| SIG1| SIG2|
+-----------------+--------------------+--------------------+--------------------+--------------------+
| NUM| SIG1| SIG2| SIG3| SIG4|
+-----------------+--------------------+--------------------+--------------------+--------------------+
|XXXXX01|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX02|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|
|XXXXX03|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX04|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX05|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX06|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|
|XXXXX07|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX08|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
用于从eventhub获取数据并将其存储到数据帧中的代码
val connectionString=ConnectionStringBuilder()
.setEventHubName().build
val currTime=Instant.now
val ehConf=EventHubsConf(connectionString)
.setConsumerGroup(“”)
.设置开始位置(事件位置
.fromEnqueuedTime(currTime.减号(持续时间分钟(30)))
.setEndingPosition(EventPosition.fromEnqueuedTime(currTime))
val reader=spark.read.format(“eventhubs”).options(ehConf.toMap.load())
var信号=读卡器
.select(get_json_object(($“body”).cast(“string”),“$.NUM”).alias(“NUM”),
get_json_object(($“body”).cast(“string”),“$.SIG1”).alias(“SIG1”),
get_json_object(($“body”).cast(“string”),“$.SIG2”).alias(“SIG2”),
获取对象(($“body”).cast(“string”),“$.SIG3”).alias(“SIG3”),
获取json对象(($“body”).cast(“string”),“$.SIG4”).alias(“SIG4”)
)
val SIGNALSFiltered=SIGNALS.filter(列(“SIG1”)。不为空&&
列(“SIG2”).isNotNull和列(“SIG3”).isNotNull和列(“SIG4”).isNotNull)
信号过滤时获得的数据如下所示
+-----------------+--------------------+--------------------+--------------------+--------------------+
| NUM| SIG1| SIG2| SIG3| SIG4|
+-----------------+--------------------+--------------------+--------------------+--------------------+
|XXXXX01|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX02|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|
|XXXXX03|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX04|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX05|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX06|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|
|XXXXX07|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX08|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
如果我们检查单个行的整个数据,它将如下所示
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825},{"TIME":1569560475000,"VALUE":3.7812},{"TIME":1569560483000,"VALUE":1.7812},{"TIME":1569560491000,"VALUE":7.7875}]|
[{"TIME":1569560537000,"VALUE":3.7825},{"TIME":1569560481000,"VALUE":9.7825},{"TIME":1569560489000,"VALUE":5.7825},{"TIME":1569560497000,"VALUE":34.7825}]|
[{"TIME":1569560505000,"VALUE":34.7825},{"TIME":1569560513000,"VALUE":9.7825},{"TIME":1569560521000,"VALUE":34.7825},{"TIME":1569560527000,"VALUE":4.7825}]|
[{"TIME":1569560535000,"VALUE":7.7825},{"TIME":1569560479000,"VALUE":35.7825},{"TIME":1569560487000,"VALUE":3.7825}]
我想将每个信号列中的每个时间值对转换为新行
+-----------------+-----------------------------+---------------------------------------+-----------------------------+
| NUM| SIG1 TIME| SIG1 VALUE| SIG2 TIME| SIG2 VALUE| SIG3 TIME| SIG3 VALUE| SIG4 TIME| SIG4 VALUE |
+-----------------+-----------------------------+---------------------------------------+-----------------------------+
|XXXXX01|1569560531000| 3.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 2.7825|
|XXXXX01|1569560531000| 1.7825|1569560531000| 1.7825| null | null |1569560531000| 2.7825|
|XXXXX01|1569560531000| 3.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 7.7825|
|XXXXX02|1569560531000| 7.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 2.7825|
|XXXXX02|null | null |1569560531000| 5.7825|1569560531000| 7.7825|1569560531000| 5.7825|
|XXXXX02|1569560531000| 3.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 2.7825|
|XXXXX02|1569560531000| 5.7825|1569560531000| 7.7825|1569560531000| 9.7825|1569560531000| 2.7825|
是否有任何方法可以转换基本数据集,如下所示?。列中的每个元素都应转换为新行
+-----------------+-----------------------------+---------------------------------------+-----------------------------+
| NUM| SIG1 TIME| SIG1 VALUE| SIG2 TIME| SIG2 VALUE| SIG3 TIME| SIG3 VALUE| SIG4 TIME| SIG4 VALUE |
+-----------------+-----------------------------+---------------------------------------+-----------------------------+
|XXXXX01|1569560531000| 3.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 2.7825|
|XXXXX01|1569560531000| 1.7825|1569560531000| 1.7825| null | null |1569560531000| 2.7825|
|XXXXX01|1569560531000| 3.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 7.7825|
|XXXXX02|1569560531000| 7.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 2.7825|
|XXXXX02|null | null |1569560531000| 5.7825|1569560531000| 7.7825|1569560531000| 5.7825|
|XXXXX02|1569560531000| 3.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 2.7825|
|XXXXX02|1569560531000| 5.7825|1569560531000| 7.7825|1569560531000| 9.7825|1569560531000| 2.7825|
任何线索或帮助都将不胜感激!提前感谢。您可以使用
explode
功能进行操作。它将为数组中的每个元素生成新行,然后您可以使用点语法访问字段time
和value
(访问结构的字段)。以下是第一列的一个简单示例:
data
.withColumn("sig1_obj", explode($"SIG1"))
.withColumn("sig1_time", $"sig1_obj.time")
.withColumn("sig1_value", $"sig1_obj.value")
.show()
+--------------------+--------------------+-------------+----------+
| SIG1| sig1_obj| sig1_time|sig1_value|
+--------------------+--------------------+-------------+----------+
|[[1569560531000, ...|[1569560531000, 3...|1569560531000| 3.7825|
|[[1569560531000, ...|[1569560475000, 3...|1569560475000| 3.7812|
|[[1569560531000, ...|[1569560483000, 1...|1569560483000| 1.7812|
|[[1569560531000, ...|[1569560491000, 7...|1569560491000| 7.7875|
+--------------------+--------------------+-------------+----------+
同样,您也可以处理其他列
还请注意,使用此技术将数据相乘,对于第二列,您将得到n*m
行,其中n
是sig1数组中的元素数,m
是sig2数组中的元素数,依此类推。如果您不想这样做,您可以在一个单独的数据框中分解每一列,然后在某些字段上完全外部连接这些数据框(可能是为每个NUM
的行编号,并在NUM
列和行编号上连接)
编辑:
由于sig列中有StringType,因此可以首先使用from_json
函数将此字符串字段转换为结构数组。在您的示例中,可以按如下方式执行:
import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}
val schema = ArrayType(StructType(Seq(StructField("TIME", StringType), StructField("VALUE", StringType))))
df.withColumn("sig1_arr", from_json($"SIG1", schema))
您可以使用
explode
功能进行此操作。它将为数组中的每个元素生成新行,然后您可以使用点语法访问字段time
和value
(访问结构的字段)。以下是第一列的一个简单示例:
data
.withColumn("sig1_obj", explode($"SIG1"))
.withColumn("sig1_time", $"sig1_obj.time")
.withColumn("sig1_value", $"sig1_obj.value")
.show()
+--------------------+--------------------+-------------+----------+
| SIG1| sig1_obj| sig1_time|sig1_value|
+--------------------+--------------------+-------------+----------+
|[[1569560531000, ...|[1569560531000, 3...|1569560531000| 3.7825|
|[[1569560531000, ...|[1569560475000, 3...|1569560475000| 3.7812|
|[[1569560531000, ...|[1569560483000, 1...|1569560483000| 1.7812|
|[[1569560531000, ...|[1569560491000, 7...|1569560491000| 7.7875|
+--------------------+--------------------+-------------+----------+
同样,您也可以处理其他列
还请注意,使用此技术将数据相乘,对于第二列,您将得到n*m
行,其中n
是sig1数组中的元素数,m
是sig2数组中的元素数,依此类推。如果您不想这样做,您可以在一个单独的数据框中分解每一列,然后在某些字段上完全外部连接这些数据框(可能是为每个NUM
的行编号,并在NUM
列和行编号上连接)
编辑:
由于sig列中有StringType,因此可以首先使用from_json
函数将此字符串字段转换为结构数组。在您的示例中,可以按如下方式执行:
import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}
val schema = ArrayType(StructType(Seq(StructField("TIME", StringType), StructField("VALUE", StringType))))
df.withColumn("sig1_arr", from_json($"SIG1", schema))
是的,我也尝试了explode函数,但由于信号类型不匹配,它失败了。信号不是数组或映射类型
scala>SIGNALSFiltered.printSchema root |--NUM:string(nullable=true)|--SIG1:string(nullable=true)|--SIG2:string(nullable=true)|--SIG3:string(nullable=true)|--SIG4:string(nullable=true)
@DavidVrba@Antony好啊但是,如果字符串是结构数组的形式,则可以使用from_json函数和提供的模式进行转换。请参阅更新的答案,因为我必须转换的列是我上面评论的字符串类型,所以此解决方案不起作用。在应用此解决方案之前,必须将信号列从String类型转换为Struct/Array@大卫Vrba@Antony很抱歉,指定的架构错误,我编辑了答案并更改了架构。它现在对我有效。@Antony使用正确的架构很重要,否则结果将是空值。是的,我也尝试了explode函数,但由于信号类型不匹配,它失败了。信号不是数组或映射类型scala>SIGNALSFiltered.printSchema root |--NUM:string(nullable=true)|--SIG1:string(nullable=true)|--SIG2:string(nullable=true)|--SIG3:string(nullable=true)|--SIG4:string(nullable=true)
@DavidVrba@Antony好啊但是,如果字符串是结构数组的形式,则可以使用from_json函数和提供的模式进行转换。请参阅更新的答案,因为我必须转换的列是我上面评论的字符串类型,所以此解决方案不起作用。在应用此解决方案之前,必须将信号列从String类型转换为Struct/Array@大卫Vrba@Antony很抱歉,指定的架构错误,我编辑了答案并更改了架构。它现在对我有效。@Antony使用正确的模式很重要,否则结果将是空值。