Apache spark 在spark数据帧中合并时间戳列的最有效方法
在spark数据帧中合并两列最有效的方法是什么 我有两列意思相同。Apache spark 在spark数据帧中合并时间戳列的最有效方法,apache-spark,dataframe,pyspark,Apache Spark,Dataframe,Pyspark,在spark数据帧中合并两列最有效的方法是什么 我有两列意思相同。timestamp中的空值应该用到appenddata\u timestamp 当两列都有值时,表示值相等 我有这个: +--------------------+----------------------+--------+ | timestamp|toAppendData_timestamp| value| +--------------------+----------------------+--
timestamp
中的空值应该用到appenddata\u timestamp
当两列都有值时,表示值相等
我有这个:
+--------------------+----------------------+--------+
| timestamp|toAppendData_timestamp| value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...| null| null|
| null| 2016-03-24 22:12:...|0.015625|
| null| 2016-03-19 15:54:...| 5.375|
|2016-03-19 15:55:...| 2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...| null| null|
|2016-03-24 22:11:...| 2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+
我需要这个:
+--------------------+----------------------+--------+
| timestamp_merged|toAppendData_timestamp| value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...| null| null|
|2016-03-24 22:12:...| 2016-03-24 22:12:...|0.015625|
|2016-03-19 15:54:...| 2016-03-19 15:54:...| 5.375|
|2016-03-19 15:55:...| 2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...| null| null|
|2016-03-24 22:11:...| 2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+
我尝试过这个,但没有成功:
appendedData = appendedData['timestamp'].fillna(appendedData['toAppendData_timestamp'])
您正在寻找的函数是
合并
。您可以从pyspark.sql.functions
导入它:
from pyspark.sql.functions import coalesce, col
和使用:
appendedData.withColumn(
'timestamp_merged',
coalesce(col('timestamp'), col('toAppendData_timestamp'))
)