在使用pyspark的情况下如何使用for循环？_Pyspark

在使用pyspark的情况下如何使用for循环？

pyspark

在使用pyspark的情况下如何使用for循环？,pyspark,Pyspark,我试图检查when和other条件中的多个列值是否为0。我们有spark dataframe，它的列从1到11，需要检查它们的值。当前我的代码如下所示：- df3 =df3.withColumn('Status', when((col("1") ==0)|(col("2") ==0)|(col("3") ==0)| (col("4") ==0) |(col("5") ==0)|(col("6") ==0)|(col("7") ==0)| (col("8") ==0)|(col("9") ==0)

我试图检查when和other条件中的多个列值是否为

。我们有spark dataframe，它的列从1到11，需要检查它们的值。当前我的代码如下所示：-

df3 =df3.withColumn('Status', when((col("1") ==0)|(col("2") ==0)|(col("3") ==0)| (col("4") ==0) |(col("5") ==0)|(col("6") ==0)|(col("7") ==0)| (col("8") ==0)|(col("9") ==0)|(col("10") ==0)| (col("11") ==0) ,'Incomplete').otherwise('Complete'))

如何通过使用for循环而不是使用大量的

或

条件来实现这一点

您可以使用下面的代码来收集条件并将它们合并到单个字符串中，然后调用

eval

代码

可能有更好的解决办法

>>> df = spark.createDataFrame([(1,0,0,2),(1,1,1,1)],['c1','c2','c3','c4'])
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  1|  0|  0|  2|
|  1|  1|  1|  1|
+---+---+---+---+

def status(x):
  l = [i for i in x]
  if 0 in l:
    return 'Incomplete'
  else:
    return 'Complete'

>>> df.rdd.map(lambda x:  (x.c1, x.c2, x.c3, x.c4,status(x))).toDF(['c1','c2','c3','c4','status']).show()
+---+---+---+---+----------+
| c1| c2| c3| c4|    status|
+---+---+---+---+----------+
|  1|  0|  0|  2|Incomplete|
|  1|  1|  1|  1|  Complete|
+---+---+---+---+----------+

我提出了一个更具python风格的解决方案。使用

functools.reduce

和

运算符。或
import operator
import functools

colnames = [str(i+1) for i in range(11)]
df1 = spark._sc.parallelize([
  [it for it in range(11)], 
  [it for it in range(1,12)]]
).toDF((colnames))

df1.show()
+---+---+---+---+---+---+---+---+---+---+---+
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+
|  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10|
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+

cond_expr = functools.reduce(operator.or_, [(f.col(c) == 0) for c in df1.columns])

df1.withColumn('test', f.when(cond_expr, f.lit('Incomplete')).otherwise('Complete')).show()
+---+---+---+---+---+---+---+---+---+---+---+----------+
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|      test|
+---+---+---+---+---+---+---+---+---+---+---+----------+
|  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10|Incomplete|
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|  Complete|
+---+---+---+---+---+---+---+---+---+---+---+----------+


这样，您就不需要定义任何函数、计算字符串表达式或使用python lambdas。希望这有帮助
 你们真的只使用数字来命名列吗？那是个糟糕的名字。通过正确的命名（至少是c1
，c2
，等等），您可以简单地使用F.expr
来获得结果。虽然在一个小示例中可以这样做，但这并不能真正扩展，因为rdd.map
和lambda
的组合将迫使Spark驱动程序为状态（）调用python函数，失去了并行化的好处。@Amol欢迎您。如果你愿意，你也可以研究其他更好的解决方案。
import operator
import functools

colnames = [str(i+1) for i in range(11)]
df1 = spark._sc.parallelize([
  [it for it in range(11)], 
  [it for it in range(1,12)]]
).toDF((colnames))

df1.show()
+---+---+---+---+---+---+---+---+---+---+---+
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+
|  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10|
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+

cond_expr = functools.reduce(operator.or_, [(f.col(c) == 0) for c in df1.columns])

df1.withColumn('test', f.when(cond_expr, f.lit('Incomplete')).otherwise('Complete')).show()
+---+---+---+---+---+---+---+---+---+---+---+----------+
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|      test|
+---+---+---+---+---+---+---+---+---+---+---+----------+
|  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10|Incomplete|
|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11|  Complete|
+---+---+---+---+---+---+---+---+---+---+---+----------+