在使用pyspark的情况下如何使用for循环?

在使用pyspark的情况下如何使用for循环?,pyspark,Pyspark,我试图检查when和other条件中的多个列值是否为0。我们有spark dataframe,它的列从1到11,需要检查它们的值。当前我的代码如下所示:- df3 =df3.withColumn('Status', when((col("1") ==0)|(col("2") ==0)|(col("3") ==0)| (col("4") ==0) |(col("5") ==0)|(col("6") ==0)|(col("7") ==0)| (col("8") ==0)|(col("9") ==0)

我试图检查when和other条件中的多个列值是否为
0
。我们有spark dataframe,它的列从1到11,需要检查它们的值。当前我的代码如下所示:-

df3 =df3.withColumn('Status', when((col("1") ==0)|(col("2") ==0)|(col("3") ==0)| (col("4") ==0) |(col("5") ==0)|(col("6") ==0)|(col("7") ==0)| (col("8") ==0)|(col("9") ==0)|(col("10") ==0)| (col("11") ==0) ,'Incomplete').otherwise('Complete'))

如何通过使用for循环而不是使用大量的
条件来实现这一点

您可以使用下面的代码来收集条件并将它们合并到单个字符串中,然后调用
eval

代码


可能有更好的解决办法

>>> df = spark.createDataFrame([(1,0,0,2),(1,1,1,1)],['c1','c2','c3','c4'])
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  1|  0|  0|  2|
|  1|  1|  1|  1|
+---+---+---+---+

def status(x):
  l = [i for i in x]
  if 0 in l:
    return 'Incomplete'
  else:
    return 'Complete'

>>> df.rdd.map(lambda x:  (x.c1, x.c2, x.c3, x.c4,status(x))).toDF(['c1','c2','c3','c4','status']).show()
+---+---+---+---+----------+
| c1| c2| c3| c4|    status|
+---+---+---+---+----------+
|  1|  0|  0|  2|Incomplete|
|  1|  1|  1|  1|  Complete|
+---+---+---+---+----------+
我提出了一个更具python风格的解决方案。使用
functools.reduce
运算符。或

import operator
import functools

colnames = [str(i+1) for i in range(11)]
df1 = spark._sc.parallelize([
  [it for it in range(11)], 
  [it for it in range(1,12)]]
).toDF((colnames))

df1.show()
+---+---+---+---+---+---+---+---+---+---+---+
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+
|  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10|
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+

cond_expr = functools.reduce(operator.or_, [(f.col(c) == 0) for c in df1.columns])

df1.withColumn('test', f.when(cond_expr, f.lit('Incomplete')).otherwise('Complete')).show()
+---+---+---+---+---+---+---+---+---+---+---+----------+
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|      test|
+---+---+---+---+---+---+---+---+---+---+---+----------+
|  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10|Incomplete|
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|  Complete|
+---+---+---+---+---+---+---+---+---+---+---+----------+


这样,您就不需要定义任何函数、计算字符串表达式或使用python lambdas。希望这有帮助

你们真的只使用数字来命名列吗?那是个糟糕的名字。通过正确的命名(至少是
c1
c2
,等等),您可以简单地使用
F.expr
来获得结果。虽然在一个小示例中可以这样做,但这并不能真正扩展,因为
rdd.map
lambda
的组合将迫使Spark驱动程序为
状态()调用python
函数,失去了并行化的好处。@Amol欢迎您。如果你愿意,你也可以研究其他更好的解决方案。
import operator
import functools

colnames = [str(i+1) for i in range(11)]
df1 = spark._sc.parallelize([
  [it for it in range(11)], 
  [it for it in range(1,12)]]
).toDF((colnames))

df1.show()
+---+---+---+---+---+---+---+---+---+---+---+
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+
|  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10|
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+

cond_expr = functools.reduce(operator.or_, [(f.col(c) == 0) for c in df1.columns])

df1.withColumn('test', f.when(cond_expr, f.lit('Incomplete')).otherwise('Complete')).show()
+---+---+---+---+---+---+---+---+---+---+---+----------+
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|      test|
+---+---+---+---+---+---+---+---+---+---+---+----------+
|  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10|Incomplete|
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|  Complete|
+---+---+---+---+---+---+---+---+---+---+---+----------+