Pyspark：如何编写复杂的数据帧计算代码_Pyspark_Apache Spark Sql_Pyspark Dataframes

Pyspark：如何编写复杂的数据帧计算代码

pyspark

Pyspark：如何编写复杂的数据帧计算代码,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,数据框已按日期排序 col1==1的值是唯一的如果传递col1==1，它将增加增量1（例如1,2,3,4,5,6,7…）只有-1是重复的我有一个数据框，看起来像这样，叫它df TEST_schema = StructType([StructField("date", StringType(), True),\ StructField("col1", IntegerType(), True),\

数据框已按日期排序

col1==1的值是唯一的

如果传递col1==1，它将增加增量1（例如1,2,3,4,5,6,7…）只有-1是重复的

我有一个数据框，看起来像这样，叫它df

TEST_schema = StructType([StructField("date", StringType(), True),\
                          StructField("col1", IntegerType(), True),\
                          StructField("col2", IntegerType(), True)])
TEST_data = [('2020-08-01',-1,-1),('2020-08-02',-1,-1),('2020-08-03',-1,3),('2020-08-04',-1,2),('2020-08-05',1,4),\
             ('2020-08-06',2,1),('2020-08-07',3,2),('2020-08-08',4,3),('2020-08-09',5,-1)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df.show()



+--------+----+----+
    date |col1|col2|
+--------+----+----+
2020-08-01| -1|  -1|
2020-08-02| -1|  -1|
2020-08-03| -1|   3|
2020-08-04| -1|   2|
2020-08-05| 1 |   4|
2020-08-06| 2 |   1|
2020-08-07| 3 |   2|
2020-08-08| 4 |   3|
2020-08-09| 5 |  -1|
+--------+----+----+

条件是当col1==1时，我们开始从col2==4向后加（例如4,5,6,7,8，…），后面的col2==4一路返回0（例如4,0,0,0…）

所以，我的结果df看起来像这样

   +--------+----+----+----+
        date |col1|col2|want
    +--------+----+----+----+
    2020-08-01| -1|  -1|  8 |
    2020-08-02| -1|  -1|  7 |
    2020-08-03| -1|   3|  6 |
    2020-08-04| -1|   2|  5 |
    2020-08-05| 1 |   4|  4 |
    2020-08-06| 2 |   1|  0 |
    2020-08-07| 3 |   2|  0 |
    2020-08-08| 4 |   3|  0 |
    2020-08-09| 5 |  -1|  0 |
   +---------+----+----+----+

增强：我想添加附加条件，其中col2==-1 col1==1（2020-08-05），col2==1连续运行。。然后我想计算连续的-1，然后在连续的断点处加上col2==？价值下面是一个需要澄清的例子

    +--------+----+----+----+
        date |col1|col2|want
    +--------+----+----+----+
    2020-08-01| -1|  -1|  11|
    2020-08-02| -1|  -1|  10|
    2020-08-03| -1|   3|  9 |
    2020-08-04| -1|   2|  8 |
    2020-08-05| 1 |  -1|  7*|
    2020-08-06| 2 |  -1|  0 |
    2020-08-07| 3 |  -1|  0 |
    2020-08-08| 4 |  4*|  0 |
    2020-08-09| 5 |  -1|  0 |
   +---------+----+----+----+

因此，我们看到3个连续的-1（从2020-08-05开始，我们只关心第一个连续的-1），在连续的-1之后我们有4个（在2020-08-08表示为*），那么我们将在col1==1行有4+3=7。可能吗

**我的第一次尝试**

TEST_df = TEST_df.withColumn('cumsum', sum(when( col('col1') < 1, col('col1') ) \
                 .otherwise( when( col('col1') == 1, 1).otherwise(0))).over(Window.partitionBy('col1').orderBy().rowsBetween(-sys.maxsize, 0)))
TEST_df.show()

+----------+----+----+------+
|      date|col1|col2|cumsum|
+----------+----+----+------+
|2020-08-01|  -1|  -1|    -1|
|2020-08-02|  -1|  -1|    -2|
|2020-08-03|  -1|   3|    -3|
|2020-08-04|  -1|   2|    -4|
|2020-08-05|   1|   4|     1|
|2020-08-07|   3|   2|     0|
|2020-08-09|   5|  -1|     0|
|2020-08-08|   4|   3|     0|
|2020-08-06|   2|   1|     0|
+----------+----+----+------+

w1 = Window.orderBy(desc('date'))
w2 =Window.partitionBy('case').orderBy(desc('cumsum'))

TEST_df.withColumn('case', sum(when( (col('cumsum') == 1) & (col('col2') != -1) , col('col2')) \
       .otherwise(0)).over(w1)) \
  .withColumn('rank', when(col('case') != 0, rank().over(w2)-1).otherwise(0)) \
  .withColumn('want', col('case') + col('rank')) \
  .orderBy('date') \
+----------+----+----+------+----+----+----+
|date      |col1|col2|cumsum|case|rank|want|
+----------+----+----+------+----+----+----+
|2020-08-01|-1  |-1  |-1    |4   |1   |5   |
|2020-08-02|-1  |-1  |-2    |4   |2   |6   |
|2020-08-03|-1  |3   |-3    |4   |3   |7   |
|2020-08-04|-1  |2   |-4    |4   |4   |8   |
|2020-08-05|1   |4   |1     |4   |0   |4   |
|2020-08-06|2   |1   |0     |0   |0   |0   |
|2020-08-07|3   |2   |0     |0   |0   |0   |
|2020-08-08|4   |3   |0     |0   |0   |0   |
|2020-08-09|5   |-1  |0     |0   |0   |0   |
+----------+----+----+------+----+----+----+

TEST_df=TEST_df.with column（'cumsum'，sum）（当（col（'col1'）<1，col（'col1'））\
。否则（当（col（'col1'）==1，1）。否则（0））.over（Window.partitionBy（'col1'）.orderBy（）.rowsBetween（-sys.maxsize，0）））
测试_df.show（）
+----------+----+----+------+
|日期| col1 | col2 | cumsum|
+----------+----+----+------+
|2020-08-01|  -1|  -1|    -1|
|2020-08-02|  -1|  -1|    -2|
|2020-08-03|  -1|   3|    -3|
|2020-08-04|  -1|   2|    -4|
|2020-08-05|   1|   4|     1|
|2020-08-07|   3|   2|     0|
|2020-08-09|   5|  -1|     0|
|2020-08-08|   4|   3|     0|
|2020-08-06|   2|   1|     0|
+----------+----+----+------+
w1=Window.orderBy（desc（'date'））
w2=Window.partitionBy（'case'）.orderBy（desc（'cumsum'））
使用列（'case'，sum（'col（'cumsum'）==1）和（col（'col2'）！=-1），col（'col2'））进行测试\
。否则（0））。超过（w1））\
.withColumn（'rank'，当（col（'case'）！=0时，rank（）。超过（w2）-1）。否则（0））\
.withColumn（'want'，col（'case'）+col（'rank'））\
.orderBy（'日期'）\
+----------+----+----+------+----+----+----+
|date | col1 | col2 | cumsum | case | rank | want|
+----------+----+----+------+----+----+----+
|2020-08-01|-1  |-1  |-1    |4   |1   |5   |
|2020-08-02|-1  |-1  |-2    |4   |2   |6   |
|2020-08-03|-1  |3   |-3    |4   |3   |7   |
|2020-08-04|-1  |2   |-4    |4   |4   |8   |
|2020-08-05|1   |4   |1     |4   |0   |4   |
|2020-08-06|2   |1   |0     |0   |0   |0   |
|2020-08-07|3   |2   |0     |0   |0   |0   |
|2020-08-08|4   |3   |0     |0   |0   |0   |
|2020-08-09|5   |-1  |0     |0   |0   |0   |
+----------+----+----+------+----+----+----+

你看排名1,2,3,4，如果我能把它设为4,3,2,1，它看起来就像我的结果数据帧。。。。如何逆转？我试过了orderby asc和desc。。。

当然，这是在增强IIUC之前，您可以尝试以下方法：

groupby并创建一个包含所有相关行的集合列表（

VAL

在下面的代码中），按日期按顺序对列表进行排序（注意：将

groupby（lit（1））

更改为可用于将数据划分为独立子集的任何列

查找数组索引

idx

，该索引具有

col1==1

如果

col2==-1

位于

idx

，则找到从idx到列表开头的偏移量，第一行具有

col2！=-1

（注意：在当前代码中，如果

idx

之前的所有col2都是-1，则偏移量可能为空，您将必须决定需要什么。例如，使用

合并（if（…），0）

）

在我们有了offset和idx之后，

want

列可以通过以下公式计算：

IF(i<idx, 0, vals[idx-offset].col2 + offset + i - idx)

编辑：对于每个评论，添加了一个使用窗口聚合函数而不是groupby的替代方法：

from pyspark.sql import Window

# WindowSpec to cover all related Rows in the same partition
w1 = Window.partitionBy().orderBy('date').rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)

cols = ["date", "col1", "col2"]

# below `cur_idx` is the index for the current Row in array `vals`
df_new = TEST_df.withColumn('vals', sort_array(collect_list(struct(*cols)).over(w1),False)) \
    .withColumn('idx', expr("filter(sequence(0,size(vals)-1), i -> vals[i].col1=1)[0]")) \
    .withColumn('offset', expr("IF(vals[idx].col2=-1, filter(sequence(1,idx), i -> vals[idx-i].col2 != -1)[0],0)")) \
    .withColumn("cur_idx", expr("array_position(vals, struct(date,col1,col2))-1")) \
    .selectExpr(*TEST_df.columns, "IF(cur_idx<idx, 0, vals[idx-offset].col2 + offset + cur_idx - idx) as want")

从pyspark.sql导入窗口
#WindowSpec覆盖同一分区中的所有相关行
w1=Window.partitionBy（）.orderBy（'date'）.rowsBetween（Window.unboundedreceiding，Window.unboundedFollowing）
cols=[“日期”、“col1”、“col2”]
#下面的'cur_idx'是数组'vals'中当前行的索引`
df_new=TEST_df.withColumn（'vals'，sort_数组（collect_list（struct（*cols））.over（w1），False））\
.withColumn（'idx'，expr）（“过滤器（序列（0，大小（VAL）-1），i->VAL[i].col1=1）[0]”）\
.withColumn（'offset'，expr（“IF（vals[idx].col2=-1，filter（sequence（1，idx），i->vals[idx-i].col2！=-1）[0]，0）”）\
.withColumn（“cur_idx”，expr（“数组位置（vals，struct（date，col1，col2））-1”））\
.selectExpr（*TEST_df.columns，“IF（cur_idx）你的spark版本是什么？spark版本：2.4.6感谢你的辛勤工作jxc…示例使用coalesce（IF（…），0）对于这一个，我把凝聚条件放在哪里？是的，我希望它像offset=0一样，如果idx之前的col2是-1@hellotherebj理论上，它应该用于filter
语句的输出，但最终结果应该相同：IF（vals[idx].col2=-1，coalesce（filter（sequence（1，idx），i->vals[idx-i].col2！=-1[0]，0），0）
谢谢。这很有道理。你能看一下我的新问题吗？这有点类似于此..嘿，jxc，我发布了另一个难题…我自己似乎不能这样做，所以我需要你的帮助..如果你想保留原来的方法，我们必须在named_struct（）中包含所有列在内联函数中。只需使用Python格式：.selectExpr（““”内联（转换（VAL，（x，i）->命名的_结构（{}，'want'，IF（i））即可
df_new.orderBy('date').show()
+----------+----+----+----+
|      date|col1|col2|want|
+----------+----+----+----+
|2020-08-01|  -1|  -1|  11|
|2020-08-02|  -1|  -1|  10|
|2020-08-03|  -1|   3|   9|
|2020-08-04|  -1|   2|   8|
|2020-08-05|   1|  -1|   7|
|2020-08-06|   2|  -1|   0|
|2020-08-07|   3|  -1|   0|
|2020-08-08|   4|   4|   0|
|2020-08-09|   5|  -1|   0|
+----------+----+----+----+

from pyspark.sql import Window

# WindowSpec to cover all related Rows in the same partition
w1 = Window.partitionBy().orderBy('date').rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)

cols = ["date", "col1", "col2"]

# below `cur_idx` is the index for the current Row in array `vals`
df_new = TEST_df.withColumn('vals', sort_array(collect_list(struct(*cols)).over(w1),False)) \
    .withColumn('idx', expr("filter(sequence(0,size(vals)-1), i -> vals[i].col1=1)[0]")) \
    .withColumn('offset', expr("IF(vals[idx].col2=-1, filter(sequence(1,idx), i -> vals[idx-i].col2 != -1)[0],0)")) \
    .withColumn("cur_idx", expr("array_position(vals, struct(date,col1,col2))-1")) \
    .selectExpr(*TEST_df.columns, "IF(cur_idx<idx, 0, vals[idx-offset].col2 + offset + cur_idx - idx) as want")