如何在Spark(Scala或Python)中将时间范围扩展为每分钟间隔?
我有一个具有以下结构的数据集如何在Spark(Scala或Python)中将时间范围扩展为每分钟间隔?,python,mysql,scala,apache-spark,Python,Mysql,Scala,Apache Spark,我有一个具有以下结构的数据集 +-------+----------+---------------+---------------+ | tv_id | movie_id | start_time | end_time | +-------+----------+---------------+---------------+ | tv123 | movie123 | 02/05/19 3:05 | 02/05/19 3:08 | | tv234 | movie345 | 02
+-------+----------+---------------+---------------+
| tv_id | movie_id | start_time | end_time |
+-------+----------+---------------+---------------+
| tv123 | movie123 | 02/05/19 3:05 | 02/05/19 3:08 |
| tv234 | movie345 | 02/05/19 3:07 | 02/05/19 3:10 |
+-------+----------+---------------+---------------+
我试图获得的输出如下所示:
+-------+----------+---------------+
| tv_id | movie_id | minute |
+-------+----------+---------------+
| tv123 | movie123 | 02/05/19 3:05 |
| tv123 | movie123 | 02/05/19 3:06 |
| tv123 | movie123 | 02/05/19 3:07 |
| tv234 | movie345 | 02/05/19 3:07 |
| tv234 | movie345 | 02/05/19 3:08 |
| tv234 | movie345 | 02/05/19 3:09 |
+-------+----------+---------------+
详细说明:
对于tv_id:tv123,总观看时间为3分钟(3:08-3:05)
其他唱片也是如此
我尝试使用python/Scala/或SQL来获得结果。[对使用的语言没有限制]
我的python代码:
df = read_csv('data')
df[minutes_diff] = df['end_time'] - df['start_time']
for i in range(df['minutes_diff']):
finaldf = df[tv_id] + df[movie_id] + df['start_time'] + df[minutes_diff] + "i"
我不知道该怎么办。
我不太熟悉Scala平面图。一些关于StackOverflow的研究指出使用flatmap,但我不确定如何在flatmap中使用diff来代替聚合
注意:我不想为SQL和Python打开单独的线程,因此在同一个问题中结合所有这两个线程。
即使是sql解决方案对我来说也是非常好的。这里有一个基于Scala的解决方案,它使用UDF,通过
java.time
API将时间范围扩展为每分钟列表,然后使用Spark的内置explode
方法将其展平:
import org.apache.spark.sql.functions._
val df = Seq(
("tv123", "movie123", "02/05/19 3:05", "02/05/19 3:08"),
("tv234", "movie345", "02/05/19 3:07", "02/05/19 3:10")
).toDF("tv_id", "movie_id", "start_time", "end_time")
def minuteList(timePattern: String) = udf{ (timeS1: String, timeS2: String) =>
import java.time.LocalDateTime
import java.time.format.DateTimeFormatter
val timeFormat = DateTimeFormatter.ofPattern(timePattern)
val t1 = LocalDateTime.parse(timeS1, timeFormat)
val t2 = LocalDateTime.parse(timeS2, timeFormat)
Iterator.iterate(t1)(_.plusMinutes(1)).takeWhile(_ isBefore t2).
map(_.format(timeFormat)).
toList
}
df.
withColumn("minute_list", minuteList("MM/dd/yy H:mm")($"start_time", $"end_time")).
withColumn("minute", explode($"minute_list")).
select("tv_id", "movie_id", "minute").
show(false)
// +-----+--------+-------------+
// |tv_id|movie_id|minute |
// +-----+--------+-------------+
// |tv123|movie123|02/05/19 3:05|
// |tv123|movie123|02/05/19 3:06|
// |tv123|movie123|02/05/19 3:07|
// |tv234|movie345|02/05/19 3:07|
// |tv234|movie345|02/05/19 3:08|
// |tv234|movie345|02/05/19 3:09|
// +-----+--------+-------------+
下面是一个基于Scala的解决方案,它使用一个UDF,通过java.time
API将时间范围扩展为每分钟列表,然后使用Spark的内置explode
方法将其展平:
import org.apache.spark.sql.functions._
val df = Seq(
("tv123", "movie123", "02/05/19 3:05", "02/05/19 3:08"),
("tv234", "movie345", "02/05/19 3:07", "02/05/19 3:10")
).toDF("tv_id", "movie_id", "start_time", "end_time")
def minuteList(timePattern: String) = udf{ (timeS1: String, timeS2: String) =>
import java.time.LocalDateTime
import java.time.format.DateTimeFormatter
val timeFormat = DateTimeFormatter.ofPattern(timePattern)
val t1 = LocalDateTime.parse(timeS1, timeFormat)
val t2 = LocalDateTime.parse(timeS2, timeFormat)
Iterator.iterate(t1)(_.plusMinutes(1)).takeWhile(_ isBefore t2).
map(_.format(timeFormat)).
toList
}
df.
withColumn("minute_list", minuteList("MM/dd/yy H:mm")($"start_time", $"end_time")).
withColumn("minute", explode($"minute_list")).
select("tv_id", "movie_id", "minute").
show(false)
// +-----+--------+-------------+
// |tv_id|movie_id|minute |
// +-----+--------+-------------+
// |tv123|movie123|02/05/19 3:05|
// |tv123|movie123|02/05/19 3:06|
// |tv123|movie123|02/05/19 3:07|
// |tv234|movie345|02/05/19 3:07|
// |tv234|movie345|02/05/19 3:08|
// |tv234|movie345|02/05/19 3:09|
// +-----+--------+-------------+