Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/ruby-on-rails-3/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何在Pyspark DataFrame中扫描列以获取新列_Apache Spark_Pyspark - Fatal编程技术网

Apache spark 如何在Pyspark DataFrame中扫描列以获取新列

Apache spark 如何在Pyspark DataFrame中扫描列以获取新列,apache-spark,pyspark,Apache Spark,Pyspark,我有一个Pyspark数据帧,有两列:sendtime和charge_state,如果charge_状态从“off”变为“on”,一个新的充电周期开始 现在我想标记每个充电周期,以获得输出 输入: +-------------------+------------+ | sendtime|charge_state| +-------------------+------------+ |2018-03-02 08:00:00| on| ... |2018-0

我有一个Pyspark数据帧,有两列:sendtime和charge_state,如果charge_状态从“off”变为“on”,一个新的充电周期开始

现在我想标记每个充电周期,以获得输出

输入:

+-------------------+------------+
|           sendtime|charge_state|
+-------------------+------------+
|2018-03-02 08:00:00|          on|
...
|2018-03-02 09:42:32|          on|
|2018-03-02 09:42:33|          on|
|2018-03-02 09:42:34|          on|
|2018-03-02 09:42:35|         off|
|2018-03-02 09:42:36|         off|
...
|2018-03-02 10:11:12|         off|
|2018-03-02 10:11:13|          on|
|2018-03-02 10:11:14|          on|
...
+-------------------+------------+---------------+
|           sendtime|charge_state|charge_cycle_ID|
+-------------------+------------+---------------+
|2018-03-02 08:00:00|          on|             c1|
...
|2018-03-02 09:42:32|          on|             c1|
|2018-03-02 09:42:33|          on|             c1|
|2018-03-02 09:42:34|          on|             c1|
|2018-03-02 09:42:35|         off|             c1|
|2018-03-02 09:42:36|         off|             c1|
...
|2018-03-02 10:11:12|         off|             c1|
|2018-03-02 10:11:13|          on|             c2|
|2018-03-02 10:11:14|          on|             c2|
...
输出:

+-------------------+------------+
|           sendtime|charge_state|
+-------------------+------------+
|2018-03-02 08:00:00|          on|
...
|2018-03-02 09:42:32|          on|
|2018-03-02 09:42:33|          on|
|2018-03-02 09:42:34|          on|
|2018-03-02 09:42:35|         off|
|2018-03-02 09:42:36|         off|
...
|2018-03-02 10:11:12|         off|
|2018-03-02 10:11:13|          on|
|2018-03-02 10:11:14|          on|
...
+-------------------+------------+---------------+
|           sendtime|charge_state|charge_cycle_ID|
+-------------------+------------+---------------+
|2018-03-02 08:00:00|          on|             c1|
...
|2018-03-02 09:42:32|          on|             c1|
|2018-03-02 09:42:33|          on|             c1|
|2018-03-02 09:42:34|          on|             c1|
|2018-03-02 09:42:35|         off|             c1|
|2018-03-02 09:42:36|         off|             c1|
...
|2018-03-02 10:11:12|         off|             c1|
|2018-03-02 10:11:13|          on|             c2|
|2018-03-02 10:11:14|          on|             c2|
...

您可以使用窗口功能执行此任务:

从pyspark.sql导入函数为F
从pyspark.sql导入窗口
df.withColumn(
“充电状态滞后”,
F.lag('charge_state').over(Window.partitionBy().orderBy('sendtime'))
).withColumn(
“fg”,
F.when((F.col(“充电状态”)==“打开”)和(F.col(“充电状态滞后”)==“关闭”),1)。否则(0)
).选择(
“发送时间”,
“充电状态”,
海螺(
F.lit('C'),
(F.sum('fg').over(Window.partitionBy().orderBy('sendtime'))+1.cast('string'))
).别名(“充电周期ID”)
).show()

您的要求表明,数据必须在一个执行器上处理,这是非常有效的,但它不能放入clusterWindow的任何单个节点。要解决这个问题,函数是一个很好的方法,非常感谢Steven!我的情况比这更复杂,我会试一试然后再来back@Mr.Young这个解决方案怎么样?如果它有效,考虑接受答案它有效!窗口函数是在DataFrame上执行此任务的唯一方法。我正在考虑将数据拟合到流媒体,因为我的计算很难仅用窗口函数来表示,如果下行投票者能够说出错误。。。如果可以的话,那会很有帮助,因为我的名声很低