Apache spark 取决于值条件的数组元素之和_Apache Spark_Pyspark_Pyspark Sql_Pyspark Dataframes

Apache spark 取决于值条件的数组元素之和

apache-spark pyspark

Apache spark 取决于值条件的数组元素之和,apache-spark,pyspark,pyspark-sql,pyspark-dataframes,Apache Spark,Pyspark,Pyspark Sql,Pyspark Dataframes,我有一个pyspark数据帧： id | column ------------------------------ 1 | [0.2, 2, 3, 4, 3, 0.5] ------------------------------ 2 | [7, 0.3, 0.3, 8, 2,] ------------------------------ 我想创建一个3列：第1列：包含2 第3列：包含元素的总和=2（有时我有重复的值，所以我做它们的总和），以防我没有我将值设置

我有一个pyspark数据帧：

id   |   column
------------------------------
1    |  [0.2, 2, 3, 4, 3, 0.5]
------------------------------
2    |  [7, 0.3, 0.3, 8, 2,]
------------------------------

我想创建一个3列：

```
第1列
```
：包含<2个元素的总和
```
第2列
```
：包含元素的总和>2
```
第3列
```
：包含元素的总和=2（有时我有重复的值，所以我做它们的总和），以防我没有我将值设置为null

预期结果：

id   |   column               |  column<2 |  column>2   | column=2 
------------------------------|--------------------------------------------  
1    |  [0.2, 2, 3, 4, 3, 0.5]|  [0.7]    |  [12]       |  null
---------------------------------------------------------------------------
2    |  [7, 0.3, 0.3, 8, 2,]  | [0.6]     |  [15]       |  [2]
---------------------------------------------------------------------------

id | column | column2 | column=2
------------------------------|--------------------------------------------  
1 |[0.2,2,3,4,3,0.5]|[0.7]|[12]|空
---------------------------------------------------------------------------
2    |  [7, 0.3, 0.3, 8, 2,]  | [0.6]     |  [15]       |  [2]
---------------------------------------------------------------------------

你能帮我吗？

谢谢你

这里有一种方法你可以试试：

import pyspark.sql.functions as F

# using map filter the list and count based on condition
s = (df
     .select('column')
     .rdd
     .map(lambda x: [[i for i in x.column if i < 2], 
                     [i for i in x.column if i > 2], 
                     [i for i in x.column if i == 2]])
     .map(lambda x: [Row(round(sum(i), 2)) for i in x]))
     .toDF(['col<2','col>2','col=2'])

# create a dummy id so we can join both data frames
df = df.withColumn('mid', F.monotonically_increasing_id())
s = s.withColumn('mid', F.monotonically_increasing_id())

#simple left join
df = df.join(s, on='mid').drop('mid').show()

+---+--------------------+-----+------+-----+
| id|              column|col<2| col>2|col=2|
+---+--------------------+-----+------+-----+
|  0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
|  1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+

导入pyspark.sql.F函数
#使用映射过滤器根据条件过滤列表和计数
s=（df
.select（'列'）
博士
.map（λx:[[i对于x列中的i，如果i<2]，
[i为x列中的i，如果i>2]，
[i=2时x列中i的i]]
.map（λx:[行（四舍五入（i），2））表示x中的i]））
.toDF（['col2'，'col=2']））
#创建一个虚拟id，以便我们可以连接两个数据帧
df=df.withColumn（'mid'，F.单调地增加\u id（））
s=s.withColumn（'mid'，F.单调地增加\u id（））
#简单左连接
df=df.join（s，on='mid'）.drop（'mid'）.show（）
+---+--------------------+-----+------+-----+
|id |列| col2 | col=2|
+---+--------------------+-----+------+-----+
|  0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
|  1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+

以下是您可以尝试的方法：

import pyspark.sql.functions as F

# using map filter the list and count based on condition
s = (df
     .select('column')
     .rdd
     .map(lambda x: [[i for i in x.column if i < 2], 
                     [i for i in x.column if i > 2], 
                     [i for i in x.column if i == 2]])
     .map(lambda x: [Row(round(sum(i), 2)) for i in x]))
     .toDF(['col<2','col>2','col=2'])

# create a dummy id so we can join both data frames
df = df.withColumn('mid', F.monotonically_increasing_id())
s = s.withColumn('mid', F.monotonically_increasing_id())

#simple left join
df = df.join(s, on='mid').drop('mid').show()

+---+--------------------+-----+------+-----+
| id|              column|col<2| col>2|col=2|
+---+--------------------+-----+------+-----+
|  0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
|  1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+

导入pyspark.sql.F函数
#使用映射过滤器根据条件过滤列表和计数
s=（df
.select（'列'）
博士
.map（λx:[[i对于x列中的i，如果i<2]，
[i为x列中的i，如果i>2]，
[i=2时x列中i的i]]
.map（λx:[行（四舍五入（i），2））表示x中的i]））
.toDF（['col2'，'col=2']））
#创建一个虚拟id，以便我们可以连接两个数据帧
df=df.withColumn（'mid'，F.单调地增加\u id（））
s=s.withColumn（'mid'，F.单调地增加\u id（））
#简单左连接
df=df.join（s，on='mid'）.drop（'mid'）.show（）
+---+--------------------+-----+------+-----+
|id |列| col2 | col=2|
+---+--------------------+-----+------+-----+
|  0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
|  1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+

对于Spark 2.4+，您可以使用和更高阶的函数，如下所示：

df.withColumn("column<2", expr("aggregate(filter(column, x -> x < 2), 0D, (x, acc) -> acc + x)")) \
  .withColumn("column>2", expr("aggregate(filter(column, x -> x > 2), 0D, (x, acc) -> acc + x)")) \
  .withColumn("column=2", expr("aggregate(filter(column, x -> x == 2), 0D, (x, acc) -> acc + x)")) \
  .show(truncate=False)

df.withColumn（“列x<2），0D，（x，acc）->acc+x）”）\
.withColumn（“column>2”，expr（“聚合（过滤器（column，x->x>2），0D，（x，acc）->acc+x）”）\
.withColumn（“column=2”，expr（“聚合（过滤器（column，x->x==2），0D，（x，acc）->acc+x）”）\
.show（truncate=False）

给出：

+---+------------------------------+--------+--------+--------+
|id |column                        |column<2|column>2|column=2|
+---+------------------------------+--------+--------+--------+
|1  |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7     |10.0    |2.0     |
|2  |[7.0, 0.3, 0.3, 8.0, 2.0]     |0.6     |15.0    |2.0     |
+---+------------------------------+--------+--------+--------+

+---+------------------------------+--------+--------+--------+
|id |列|列2 |列=2|
+---+------------------------------+--------+--------+--------+
|1  |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7     |10.0    |2.0     |
|2  |[7.0, 0.3, 0.3, 8.0, 2.0]     |0.6     |15.0    |2.0     |
+---+------------------------------+--------+--------+--------+

对于Spark 2.4+，您可以使用和更高阶的函数，如下所示：

df.withColumn("column<2", expr("aggregate(filter(column, x -> x < 2), 0D, (x, acc) -> acc + x)")) \
  .withColumn("column>2", expr("aggregate(filter(column, x -> x > 2), 0D, (x, acc) -> acc + x)")) \
  .withColumn("column=2", expr("aggregate(filter(column, x -> x == 2), 0D, (x, acc) -> acc + x)")) \
  .show(truncate=False)

df.withColumn（“列x<2），0D，（x，acc）->acc+x）”）\
.withColumn（“column>2”，expr（“聚合（过滤器（column，x->x>2），0D，（x，acc）->acc+x）”）\
.withColumn（“column=2”，expr（“聚合（过滤器（column，x->x==2），0D，（x，acc）->acc+x）”）\
.show（truncate=False）

给出：

+---+------------------------------+--------+--------+--------+
|id |column                        |column<2|column>2|column=2|
+---+------------------------------+--------+--------+--------+
|1  |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7     |10.0    |2.0     |
|2  |[7.0, 0.3, 0.3, 8.0, 2.0]     |0.6     |15.0    |2.0     |
+---+------------------------------+--------+--------+--------+

+---+------------------------------+--------+--------+--------+
|id |列|列2 |列=2|
+---+------------------------------+--------+--------+--------+
|1  |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7     |10.0    |2.0     |
|2  |[7.0, 0.3, 0.3, 8.0, 2.0]     |0.6     |15.0    |2.0     |
+---+------------------------------+--------+--------+--------+

对于Spark 2.4+，您可以使用函数一步完成计算：

from pyspark.sql.functions import expr

# I adjusted the 2nd array-item in id=1 from 2.0 to 2.1 so there is no `2.0` when id=1
df = spark.createDataFrame([(1,[0.2, 2.1, 3., 4., 3., 0.5]),(2,[7., 0.3, 0.3, 8., 2.,])],['id','column'])

df.withColumn('data', expr("""

    aggregate(
      /* ArrayType argument */
      column,
      /* zero: set empty array to initialize acc */
      array(),
      /* merge: iterate through `column` and reduce based on the values of y and the array indices of acc */
      (acc, y) ->
        CASE
          WHEN y < 2.0 THEN array(IFNULL(acc[0],0) + y, acc[1], acc[2])
          WHEN y > 2.0 THEN array(acc[0], IFNULL(acc[1],0) + y, acc[2])
                       ELSE array(acc[0], acc[1], IFNULL(acc[2],0) + y)
        END,
      /* finish: to convert the array into a named_struct */
      acc -> (acc[0] as `column<2`, acc[1] as `column>2`, acc[2] as `column=2`)
    )

""")).selectExpr('id', 'data.*').show()
#+---+--------+--------+--------+
#| id|column<2|column>2|column=2|
#+---+--------+--------+--------+
#|  1|     0.7|    12.1|    null|
#|  2|     0.6|    15.0|     2.0|
#+---+--------+--------+--------+

对于Spark 2.4+，您可以使用函数，一步完成计算：

from pyspark.sql.functions import expr

# I adjusted the 2nd array-item in id=1 from 2.0 to 2.1 so there is no `2.0` when id=1
df = spark.createDataFrame([(1,[0.2, 2.1, 3., 4., 3., 0.5]),(2,[7., 0.3, 0.3, 8., 2.,])],['id','column'])

df.withColumn('data', expr("""

    aggregate(
      /* ArrayType argument */
      column,
      /* zero: set empty array to initialize acc */
      array(),
      /* merge: iterate through `column` and reduce based on the values of y and the array indices of acc */
      (acc, y) ->
        CASE
          WHEN y < 2.0 THEN array(IFNULL(acc[0],0) + y, acc[1], acc[2])
          WHEN y > 2.0 THEN array(acc[0], IFNULL(acc[1],0) + y, acc[2])
                       ELSE array(acc[0], acc[1], IFNULL(acc[2],0) + y)
        END,
      /* finish: to convert the array into a named_struct */
      acc -> (acc[0] as `column<2`, acc[1] as `column>2`, acc[2] as `column=2`)
    )

""")).selectExpr('id', 'data.*').show()
#+---+--------+--------+--------+
#| id|column<2|column>2|column=2|
#+---+--------+--------+--------+
#|  1|     0.7|    12.1|    null|
#|  2|     0.6|    15.0|     2.0|
#+---+--------+--------+--------+

只需一次迭代（+1）就可以很好地聚合

aggregate

。可以使用

array（0D，0D，0D）

设置零值，以避免检查空值。@jxc它工作得很好，非常感谢您提供的非常好的解决方案，但是有点晚了，之前发布了其他解决方案，谢谢：）@blackishop，您不应该使用

array（0D，0D，0D）

在初始化零值时，如果您想在没有条件的情况下设置空值。只需一次迭代（+1）即可设置

聚合

。可以使用

数组（0D，0D，0D）设置零值

为避免检查空值。@jxc它工作得很好，非常感谢您提供了非常好的解决方案，但有点晚了，之前还发布了其他解决方案，谢谢：）@blackishop，如果您想在没有条件的情况下设置空值，则不应在初始化零值时使用

数组（0D，0D，0D）

。