在不更改旧值的情况下更新pyspark中的现有列

在不更改旧值的情况下更新pyspark中的现有列,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我试图更新pyspark中的一个现有列,但似乎该列中的旧值也在更新,尽管没有其他条件 +-----+-----+-----+-----+-----+----+ |cntry|cde_1|cde_2|rsn_1|rsn_2|FLAG| +-----+-----+-----+-----+-----+----+ | MY| A| | 1| 2|null| | MY| G| | 1| 2|null| | MY| | G|

我试图更新pyspark中的一个现有列,但似乎该列中的旧值也在更新,尽管没有其他条件

+-----+-----+-----+-----+-----+----+ |cntry|cde_1|cde_2|rsn_1|rsn_2|FLAG| +-----+-----+-----+-----+-----+----+ | MY| A| | 1| 2|null| | MY| G| | 1| 2|null| | MY| | G| 1| 2|null| | TH| A| | 16| 2|null| | TH| B| | 1| 16| 1| | TH| | W| 16| 2| 1| +-----+-----+-----+-----+-----+----+
您需要使用布尔或组合条件。像这样:

df=sc.parallelize([“我的”、“A”、“1”、“2”]、[“我的”、“G”、“1”、“2”]、[“我的”、“G”、“1”、“2”]、[“TH”、“A”、“16”、“2”]、[“TH”、“B”、“1”、“16”]、[“TH”、“W”、“16”、“2”])。toDF((“cntry”、“cde_1”、“cde_2”、“rsn_1”、“rsn_2”))
cond1=(df.cntry=“MY”)和((df.cde_1.isin(“G”))(df.cde_2.isin(“G”))和((df.rsn_1=“1”)|(df.rsn_2=“1”))
cond2=(df.cntry=“TH”)和((df.cde_1.isin(“B”,“W”))(df.cde_2.isin(“B”,“W”))和((df.rsn_1=“16”)(df.rsn_2=“16”))
df.withColumn(“FLAG”,F.when(cond1 | cond2,1)).show()
在最后一行中,您覆盖了
标志
列,因为您没有引用它以前的状态。这就是为什么以前的值不被保留的原因

除了组合表达式之外,还可以使用
when(cond1,1)。否则(when(cond2,1))
。这是一种风格选择

df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"],  ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))


df = df.withColumn('FLAG', F.when( (df.cntry == "MY") &  ( (df.cde_1.isin("G") ) |  (df.cde_2.isin("G") ) )   &  ( (df.rsn_1 == "1") | (df.rsn_2 == "1") ) , 1))

df = df.withColumn('FLAG', F.when( (df.cntry == "TH") &  ( (df.cde_1.isin("B", "W") ) |  (df.cde_2.isin("B", "W") ) )  & ( (df.rsn_1 == "16") |  (df.rsn_2 == "16") ) , 1))