Apache spark 取决于值条件的数组元素之和
我有一个pyspark数据帧:Apache spark 取决于值条件的数组元素之和,apache-spark,pyspark,pyspark-sql,pyspark-dataframes,Apache Spark,Pyspark,Pyspark Sql,Pyspark Dataframes,我有一个pyspark数据帧: id | column ------------------------------ 1 | [0.2, 2, 3, 4, 3, 0.5] ------------------------------ 2 | [7, 0.3, 0.3, 8, 2,] ------------------------------ 我想创建一个3列: 第1列:包含2 第3列:包含元素的总和=2(有时我有重复的值,所以我做它们的总和),以防我没有 我将值设置
id | column
------------------------------
1 | [0.2, 2, 3, 4, 3, 0.5]
------------------------------
2 | [7, 0.3, 0.3, 8, 2,]
------------------------------
我想创建一个3列:
:包含<2个元素的总和第1列
:包含元素的总和>2第2列
:包含元素的总和=2(有时我有重复的值,所以我做它们的总和),以防我没有 我将值设置为null第3列
id | column | column<2 | column>2 | column=2
------------------------------|--------------------------------------------
1 | [0.2, 2, 3, 4, 3, 0.5]| [0.7] | [12] | null
---------------------------------------------------------------------------
2 | [7, 0.3, 0.3, 8, 2,] | [0.6] | [15] | [2]
---------------------------------------------------------------------------
id | column | column2 | column=2
------------------------------|--------------------------------------------
1 |[0.2,2,3,4,3,0.5]|[0.7]|[12]|空
---------------------------------------------------------------------------
2 | [7, 0.3, 0.3, 8, 2,] | [0.6] | [15] | [2]
---------------------------------------------------------------------------
你能帮我吗?
谢谢你这里有一种方法你可以试试:
import pyspark.sql.functions as F
# using map filter the list and count based on condition
s = (df
.select('column')
.rdd
.map(lambda x: [[i for i in x.column if i < 2],
[i for i in x.column if i > 2],
[i for i in x.column if i == 2]])
.map(lambda x: [Row(round(sum(i), 2)) for i in x]))
.toDF(['col<2','col>2','col=2'])
# create a dummy id so we can join both data frames
df = df.withColumn('mid', F.monotonically_increasing_id())
s = s.withColumn('mid', F.monotonically_increasing_id())
#simple left join
df = df.join(s, on='mid').drop('mid').show()
+---+--------------------+-----+------+-----+
| id| column|col<2| col>2|col=2|
+---+--------------------+-----+------+-----+
| 0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
| 1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+
导入pyspark.sql.F函数
#使用映射过滤器根据条件过滤列表和计数
s=(df
.select('列')
博士
.map(λx:[[i对于x列中的i,如果i<2],
[i为x列中的i,如果i>2],
[i=2时x列中i的i]]
.map(λx:[行(四舍五入(i),2))表示x中的i]))
.toDF(['col2','col=2']))
#创建一个虚拟id,以便我们可以连接两个数据帧
df=df.withColumn('mid',F.单调地增加\u id())
s=s.withColumn('mid',F.单调地增加\u id())
#简单左连接
df=df.join(s,on='mid').drop('mid').show()
+---+--------------------+-----+------+-----+
|id |列| col2 | col=2|
+---+--------------------+-----+------+-----+
| 0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
| 1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+
以下是您可以尝试的方法:
import pyspark.sql.functions as F
# using map filter the list and count based on condition
s = (df
.select('column')
.rdd
.map(lambda x: [[i for i in x.column if i < 2],
[i for i in x.column if i > 2],
[i for i in x.column if i == 2]])
.map(lambda x: [Row(round(sum(i), 2)) for i in x]))
.toDF(['col<2','col>2','col=2'])
# create a dummy id so we can join both data frames
df = df.withColumn('mid', F.monotonically_increasing_id())
s = s.withColumn('mid', F.monotonically_increasing_id())
#simple left join
df = df.join(s, on='mid').drop('mid').show()
+---+--------------------+-----+------+-----+
| id| column|col<2| col>2|col=2|
+---+--------------------+-----+------+-----+
| 0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
| 1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+
导入pyspark.sql.F函数
#使用映射过滤器根据条件过滤列表和计数
s=(df
.select('列')
博士
.map(λx:[[i对于x列中的i,如果i<2],
[i为x列中的i,如果i>2],
[i=2时x列中i的i]]
.map(λx:[行(四舍五入(i),2))表示x中的i]))
.toDF(['col2','col=2']))
#创建一个虚拟id,以便我们可以连接两个数据帧
df=df.withColumn('mid',F.单调地增加\u id())
s=s.withColumn('mid',F.单调地增加\u id())
#简单左连接
df=df.join(s,on='mid').drop('mid').show()
+---+--------------------+-----+------+-----+
|id |列| col2 | col=2|
+---+--------------------+-----+------+-----+
| 0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
| 1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+
对于Spark 2.4+,您可以使用和更高阶的函数,如下所示:
df.withColumn("column<2", expr("aggregate(filter(column, x -> x < 2), 0D, (x, acc) -> acc + x)")) \
.withColumn("column>2", expr("aggregate(filter(column, x -> x > 2), 0D, (x, acc) -> acc + x)")) \
.withColumn("column=2", expr("aggregate(filter(column, x -> x == 2), 0D, (x, acc) -> acc + x)")) \
.show(truncate=False)
df.withColumn(“列x<2),0D,(x,acc)->acc+x)”)\
.withColumn(“column>2”,expr(“聚合(过滤器(column,x->x>2),0D,(x,acc)->acc+x)”)\
.withColumn(“column=2”,expr(“聚合(过滤器(column,x->x==2),0D,(x,acc)->acc+x)”)\
.show(truncate=False)
给出:
+---+------------------------------+--------+--------+--------+
|id |column |column<2|column>2|column=2|
+---+------------------------------+--------+--------+--------+
|1 |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7 |10.0 |2.0 |
|2 |[7.0, 0.3, 0.3, 8.0, 2.0] |0.6 |15.0 |2.0 |
+---+------------------------------+--------+--------+--------+
+---+------------------------------+--------+--------+--------+
|id |列|列2 |列=2|
+---+------------------------------+--------+--------+--------+
|1 |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7 |10.0 |2.0 |
|2 |[7.0, 0.3, 0.3, 8.0, 2.0] |0.6 |15.0 |2.0 |
+---+------------------------------+--------+--------+--------+
对于Spark 2.4+,您可以使用和更高阶的函数,如下所示:
df.withColumn("column<2", expr("aggregate(filter(column, x -> x < 2), 0D, (x, acc) -> acc + x)")) \
.withColumn("column>2", expr("aggregate(filter(column, x -> x > 2), 0D, (x, acc) -> acc + x)")) \
.withColumn("column=2", expr("aggregate(filter(column, x -> x == 2), 0D, (x, acc) -> acc + x)")) \
.show(truncate=False)
df.withColumn(“列x<2),0D,(x,acc)->acc+x)”)\
.withColumn(“column>2”,expr(“聚合(过滤器(column,x->x>2),0D,(x,acc)->acc+x)”)\
.withColumn(“column=2”,expr(“聚合(过滤器(column,x->x==2),0D,(x,acc)->acc+x)”)\
.show(truncate=False)
给出:
+---+------------------------------+--------+--------+--------+
|id |column |column<2|column>2|column=2|
+---+------------------------------+--------+--------+--------+
|1 |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7 |10.0 |2.0 |
|2 |[7.0, 0.3, 0.3, 8.0, 2.0] |0.6 |15.0 |2.0 |
+---+------------------------------+--------+--------+--------+
+---+------------------------------+--------+--------+--------+
|id |列|列2 |列=2|
+---+------------------------------+--------+--------+--------+
|1 |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7 |10.0 |2.0 |
|2 |[7.0, 0.3, 0.3, 8.0, 2.0] |0.6 |15.0 |2.0 |
+---+------------------------------+--------+--------+--------+
对于Spark 2.4+,您可以使用函数一步完成计算:
from pyspark.sql.functions import expr
# I adjusted the 2nd array-item in id=1 from 2.0 to 2.1 so there is no `2.0` when id=1
df = spark.createDataFrame([(1,[0.2, 2.1, 3., 4., 3., 0.5]),(2,[7., 0.3, 0.3, 8., 2.,])],['id','column'])
df.withColumn('data', expr("""
aggregate(
/* ArrayType argument */
column,
/* zero: set empty array to initialize acc */
array(),
/* merge: iterate through `column` and reduce based on the values of y and the array indices of acc */
(acc, y) ->
CASE
WHEN y < 2.0 THEN array(IFNULL(acc[0],0) + y, acc[1], acc[2])
WHEN y > 2.0 THEN array(acc[0], IFNULL(acc[1],0) + y, acc[2])
ELSE array(acc[0], acc[1], IFNULL(acc[2],0) + y)
END,
/* finish: to convert the array into a named_struct */
acc -> (acc[0] as `column<2`, acc[1] as `column>2`, acc[2] as `column=2`)
)
""")).selectExpr('id', 'data.*').show()
#+---+--------+--------+--------+
#| id|column<2|column>2|column=2|
#+---+--------+--------+--------+
#| 1| 0.7| 12.1| null|
#| 2| 0.6| 15.0| 2.0|
#+---+--------+--------+--------+
对于Spark 2.4+,您可以使用函数,一步完成计算:
from pyspark.sql.functions import expr
# I adjusted the 2nd array-item in id=1 from 2.0 to 2.1 so there is no `2.0` when id=1
df = spark.createDataFrame([(1,[0.2, 2.1, 3., 4., 3., 0.5]),(2,[7., 0.3, 0.3, 8., 2.,])],['id','column'])
df.withColumn('data', expr("""
aggregate(
/* ArrayType argument */
column,
/* zero: set empty array to initialize acc */
array(),
/* merge: iterate through `column` and reduce based on the values of y and the array indices of acc */
(acc, y) ->
CASE
WHEN y < 2.0 THEN array(IFNULL(acc[0],0) + y, acc[1], acc[2])
WHEN y > 2.0 THEN array(acc[0], IFNULL(acc[1],0) + y, acc[2])
ELSE array(acc[0], acc[1], IFNULL(acc[2],0) + y)
END,
/* finish: to convert the array into a named_struct */
acc -> (acc[0] as `column<2`, acc[1] as `column>2`, acc[2] as `column=2`)
)
""")).selectExpr('id', 'data.*').show()
#+---+--------+--------+--------+
#| id|column<2|column>2|column=2|
#+---+--------+--------+--------+
#| 1| 0.7| 12.1| null|
#| 2| 0.6| 15.0| 2.0|
#+---+--------+--------+--------+
只需一次迭代(+1)就可以很好地聚合
aggregate
。可以使用array(0D,0D,0D)
设置零值,以避免检查空值。@jxc它工作得很好,非常感谢您提供的非常好的解决方案,但是有点晚了,之前发布了其他解决方案,谢谢:)@blackishop,您不应该使用array(0D,0D,0D)
在初始化零值时,如果您想在没有条件的情况下设置空值。只需一次迭代(+1)即可设置聚合
。可以使用数组(0D,0D,0D)设置零值
为避免检查空值。@jxc它工作得很好,非常感谢您提供了非常好的解决方案,但有点晚了,之前还发布了其他解决方案,谢谢:)@blackishop,如果您想在没有条件的情况下设置空值,则不应在初始化零值时使用数组(0D,0D,0D)
。