Python Apache火花计数状态更改
我是ApacheSpark(Pyspark)的新手,很高兴能得到一些帮助来解决这个问题。我目前正在使用Pyspark 1.6(由于不支持MQTT,我不得不放弃2.0) 我有一个数据框,它有以下信息Python Apache火花计数状态更改,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我是ApacheSpark(Pyspark)的新手,很高兴能得到一些帮助来解决这个问题。我目前正在使用Pyspark 1.6(由于不支持MQTT,我不得不放弃2.0) 我有一个数据框,它有以下信息 +----------+-----------+ | time|door_status| +----------+-----------+ |1473678846| 2| |1473678852| 1| |1473679029| 3| |
+----------+-----------+
| time|door_status|
+----------+-----------+
|1473678846| 2|
|1473678852| 1|
|1473679029| 3|
|1473679039| 3|
|1473679045| 2|
|1473679055| 1|
这基本上是门的时间和状态。我需要计算门打开和关闭的次数。因此,我需要识别状态更改,并为每个状态保留独立的计数器
由于我对这一点还不熟悉,我发现很难设想如何实现这一点。如果有人能提出一个想法&给我指出正确的方向,那就太好了
提前谢谢 在这种情况下,使用累加器应该可以解决问题。 基本上,您可以为这三种状态创建三个不同的累加器
status_1 = sc.accumulator(0)
status_2 = sc.accumulator(0)
status_3 = sc.accumulator(0)
然后你可以做如下的事情
if (status == 1):
status_1 += 1
没有一种有效的方法可以执行这种开箱即用的操作。您可以使用窗口功能:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, lag
df = sc.parallelize([
(1473678846, 2), (1473678852, 1),
(1473679029, 3), (1473679039, 3),
(1473679045, 2), (1473679055, 1)
]).toDF(["time", "door_status"])
w = Window().orderBy("time")
(df
.withColumn("prev_status", lag("door_status", 1).over(w))
.where(col("door_status") != col("prev_status"))
.groupBy("door_status", "prev_status")
.count())
def merge(x, y):
"""Given a pair of tuples:
(first-state, last-state, counter_of changes)
return a tuple of the same shape representing aggregated results
>>> merge((None, None, Counter()), (1, 1, Counter()))
(None, 1, Counter())
>>> merge((1, 2, Counter([(1, 2)])), (2, 2, Counter()))
(None, 2, Counter({(1, 2): 1}))
>>> merge((1, 2, Counter([(1, 2)])), (3, 2, Counter([(3, 2)])
(None, 2, Counter({(1, 2): 1, (2, 3): 1, (3, 2): 1}))
"""
(_, last_x, cnt_x), (first_y, last_y, cnt_y) = x, y
# If previous partition wasn't empty update counter
if last_x is not None and first_y is not None and last_x != first_y:
cnt_y[(last_x, first_y)] += 1
# Merge counters
cnt_y.update(cnt_x)
return (None, last_y, cnt_y)
但这根本无法扩展。您可以尝试映射分区
。首先,让我们定义一个用于映射分区的函数:
from collections import Counter
def process_partition(iter):
"""Given an iterator of (time, state)
return the first state, the last state and
a counter of state changes
>>> process_partition([])
[(None, None, Counter())]
>>> process_partition(enumerate([1, 1, 1]))
[(1, 1, Counter())]
>>> process_partition(enumerate([1, 2, 3]))
[(1, 3, Counter({(1, 2): 1, (2, 3): 1}))]
"""
first = None
prev = None
cnt = Counter()
for i, (_, x) in enumerate(iter):
# Store the first object per partition
if i == 0:
first = x
# If change of state update couter
if prev is not None and prev != x:
cnt[(prev, x)] += 1
prev = x
return [(first, prev, cnt)]
和一个简单的合并函数:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, lag
df = sc.parallelize([
(1473678846, 2), (1473678852, 1),
(1473679029, 3), (1473679039, 3),
(1473679045, 2), (1473679055, 1)
]).toDF(["time", "door_status"])
w = Window().orderBy("time")
(df
.withColumn("prev_status", lag("door_status", 1).over(w))
.where(col("door_status") != col("prev_status"))
.groupBy("door_status", "prev_status")
.count())
def merge(x, y):
"""Given a pair of tuples:
(first-state, last-state, counter_of changes)
return a tuple of the same shape representing aggregated results
>>> merge((None, None, Counter()), (1, 1, Counter()))
(None, 1, Counter())
>>> merge((1, 2, Counter([(1, 2)])), (2, 2, Counter()))
(None, 2, Counter({(1, 2): 1}))
>>> merge((1, 2, Counter([(1, 2)])), (3, 2, Counter([(3, 2)])
(None, 2, Counter({(1, 2): 1, (2, 3): 1, (3, 2): 1}))
"""
(_, last_x, cnt_x), (first_y, last_y, cnt_y) = x, y
# If previous partition wasn't empty update counter
if last_x is not None and first_y is not None and last_x != first_y:
cnt_y[(last_x, first_y)] += 1
# Merge counters
cnt_y.update(cnt_x)
return (None, last_y, cnt_y)
有了这两个,我们可以执行如下操作:
partials = (df.rdd
.mapPartitions(process_partition)
.collect())
reduce(merge, [(None, None, Counter())] + partials)
您可以尝试以下解决方案:
import org.apache.spark.sql.expressions.Window
var data = Seq((1473678846, 2), (1473678852, 1), (1473679029, 3), (1473679039, 3), (1473679045, 2), (1473679055, 1),(1473779045, 1), (1474679055, 2), (1475679055, 1), (1476679055, 3)).toDF("time","door_status")
data.
select(
$"*",
coalesce(lead($"door_status", 1).over(Window.orderBy($"time")), $"door_status").as("next_door_status")
).
groupBy($"door_status").
agg(
sum(($"door_status" !== $"next_door_status").cast("int")).as("door_changes")
).
show
它是用scala语言编写的。您必须在python中制作相同的程序。我在java中尝试过,但实际上,在python中,dataframes API也可以以类似的方式实现 执行以下操作:
- 以数据帧/数据集的形式加载数据
- 将数据帧注册为临时表
- 执行此查询:按状态从“门状态”组中选择状态、计数(*)
确保删除标题@zero323谢谢。第一种方法奏效了。正如您所提到的,对于大量数据,它确实显得很慢。我还没有尝试第二种方法。@DavidArenburg我不确定我是否遵循了。你能详细说明一下吗?@DavidArenburg不,当然不能。我使用它来运行具有不同分区布局的测试套件。非常感谢。