在使用pyspark的情况下如何使用for循环?
我试图检查when和other条件中的多个列值是否为在使用pyspark的情况下如何使用for循环?,pyspark,Pyspark,我试图检查when和other条件中的多个列值是否为0。我们有spark dataframe,它的列从1到11,需要检查它们的值。当前我的代码如下所示:- df3 =df3.withColumn('Status', when((col("1") ==0)|(col("2") ==0)|(col("3") ==0)| (col("4") ==0) |(col("5") ==0)|(col("6") ==0)|(col("7") ==0)| (col("8") ==0)|(col("9") ==0)
0
。我们有spark dataframe,它的列从1到11,需要检查它们的值。当前我的代码如下所示:-
df3 =df3.withColumn('Status', when((col("1") ==0)|(col("2") ==0)|(col("3") ==0)| (col("4") ==0) |(col("5") ==0)|(col("6") ==0)|(col("7") ==0)| (col("8") ==0)|(col("9") ==0)|(col("10") ==0)| (col("11") ==0) ,'Incomplete').otherwise('Complete'))
如何通过使用for循环而不是使用大量的
或
条件来实现这一点您可以使用下面的代码来收集条件并将它们合并到单个字符串中,然后调用eval
代码
可能有更好的解决办法
>>> df = spark.createDataFrame([(1,0,0,2),(1,1,1,1)],['c1','c2','c3','c4'])
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| 1| 0| 0| 2|
| 1| 1| 1| 1|
+---+---+---+---+
def status(x):
l = [i for i in x]
if 0 in l:
return 'Incomplete'
else:
return 'Complete'
>>> df.rdd.map(lambda x: (x.c1, x.c2, x.c3, x.c4,status(x))).toDF(['c1','c2','c3','c4','status']).show()
+---+---+---+---+----------+
| c1| c2| c3| c4| status|
+---+---+---+---+----------+
| 1| 0| 0| 2|Incomplete|
| 1| 1| 1| 1| Complete|
+---+---+---+---+----------+
我提出了一个更具python风格的解决方案。使用functools.reduce
和运算符。或
import operator
import functools
colnames = [str(i+1) for i in range(11)]
df1 = spark._sc.parallelize([
[it for it in range(11)],
[it for it in range(1,12)]]
).toDF((colnames))
df1.show()
+---+---+---+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+
cond_expr = functools.reduce(operator.or_, [(f.col(c) == 0) for c in df1.columns])
df1.withColumn('test', f.when(cond_expr, f.lit('Incomplete')).otherwise('Complete')).show()
+---+---+---+---+---+---+---+---+---+---+---+----------+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| test|
+---+---+---+---+---+---+---+---+---+---+---+----------+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|Incomplete|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| Complete|
+---+---+---+---+---+---+---+---+---+---+---+----------+
这样,您就不需要定义任何函数、计算字符串表达式或使用python lambdas。希望这有帮助 你们真的只使用数字来命名列吗?那是个糟糕的名字。通过正确的命名(至少是c1
,c2
,等等),您可以简单地使用F.expr
来获得结果。虽然在一个小示例中可以这样做,但这并不能真正扩展,因为rdd.map
和lambda
的组合将迫使Spark驱动程序为状态()调用python
函数,失去了并行化的好处。@Amol欢迎您。如果你愿意,你也可以研究其他更好的解决方案。
import operator
import functools
colnames = [str(i+1) for i in range(11)]
df1 = spark._sc.parallelize([
[it for it in range(11)],
[it for it in range(1,12)]]
).toDF((colnames))
df1.show()
+---+---+---+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11|
+---+---+---+---+---+---+---+---+---+---+---+
cond_expr = functools.reduce(operator.or_, [(f.col(c) == 0) for c in df1.columns])
df1.withColumn('test', f.when(cond_expr, f.lit('Incomplete')).otherwise('Complete')).show()
+---+---+---+---+---+---+---+---+---+---+---+----------+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| test|
+---+---+---+---+---+---+---+---+---+---+---+----------+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|Incomplete|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| Complete|
+---+---+---+---+---+---+---+---+---+---+---+----------+