Apache spark 更新Pyspark数据帧中数组的值_Apache Spark_Pyspark_Apache Spark Sql

Apache spark 更新Pyspark数据帧中数组的值

apache-spark pyspark

Apache spark 更新Pyspark数据帧中数组的值,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我想检查PySpark数据帧中数组的最后两个值是否为[1,0]，并将其更新为[1,1] 输入数据帧输出数据帧您可以对数组进行切片，对最后两个元素进行案例分析，并使用concat组合这两个切片您可以对数组进行切片，对最后两个元素进行案例分析，并使用concat组合这两个切片可以将数组函数与when表达式组合使用： from pyspark.sql import functions as F df1 = df.withColumn( "Array_column"

我想检查PySpark数据帧中数组的最后两个值是否为[1,0]，并将其更新为[1,1]

输入数据帧

输出数据帧

您可以对数组进行切片，对最后两个元素进行案例分析，并使用concat组合这两个切片

可以将数组函数与when表达式组合使用：

from pyspark.sql import functions as F

df1 = df.withColumn(
    "Array_column",
    F.when(
        F.slice("Array_column", -2, 2) == F.array(F.lit(1), F.lit(0)),
        F.flatten(F.array(F.expr("slice(Array_column, 1, size(Array_column) - 2)"), F.array(F.lit(1), F.lit(1))))
    ).otherwise(F.col("Array_column"))
)

df1.show()

#+-------+------------+
#|Column1|Array_column|
#+-------+------------+
#|    abc|[0, 1, 1, 1]|
#|    def|[1, 1, 0, 0]|
#|    adf|[0, 0, 1, 1]|
#+-------+------------+

可以将数组函数与when表达式组合使用：

from pyspark.sql import functions as F

df1 = df.withColumn(
    "Array_column",
    F.when(
        F.slice("Array_column", -2, 2) == F.array(F.lit(1), F.lit(0)),
        F.flatten(F.array(F.expr("slice(Array_column, 1, size(Array_column) - 2)"), F.array(F.lit(1), F.lit(1))))
    ).otherwise(F.col("Array_column"))
)

df1.show()

#+-------+------------+
#|Column1|Array_column|
#+-------+------------+
#|    abc|[0, 1, 1, 1]|
#|    def|[1, 1, 0, 0]|
#|    adf|[0, 0, 1, 1]|
#+-------+------------+

import pyspark.sql.functions as F

df2 = df.withColumn(
    'Array_column',
    F.expr("""
        concat(
            slice(Array_column, 1, size(Array_column) - 2),
            case when slice(Array_column, size(Array_column) - 1, 2) = array(1,0) 
                 then array(1,1)
                 else slice(Array_column, size(Array_column) - 1, 2)
            end
         )
    """)
)

df2.show()
+-------+------------+
|Column1|Array_column|
+-------+------------+
|    abc|[0, 1, 1, 1]|
|    def|[1, 1, 0, 0]|
|    adf|[0, 0, 1, 1]|
+-------+------------+

from pyspark.sql import functions as F

df1 = df.withColumn(
    "Array_column",
    F.when(
        F.slice("Array_column", -2, 2) == F.array(F.lit(1), F.lit(0)),
        F.flatten(F.array(F.expr("slice(Array_column, 1, size(Array_column) - 2)"), F.array(F.lit(1), F.lit(1))))
    ).otherwise(F.col("Array_column"))
)

df1.show()

#+-------+------------+
#|Column1|Array_column|
#+-------+------------+
#|    abc|[0, 1, 1, 1]|
#|    def|[1, 1, 0, 0]|
#|    adf|[0, 0, 1, 1]|
#+-------+------------+

 >>> def udf1(i):
      if (i[2]==1) & (i[3]==0):
       i[3]=1
      else:
        i[3]=i[3]
      return i

>>> udf2=udf(udf1)
df1.withColumn("Array_Column",udf2(col("Array_Column"))).show()



+-------+------------+
|Column1|Array_Column|
+-------+------------+
|    abc|[0, 1, 1, 1]|
|    def|[1, 1, 0, 0]|
|    adf|[0, 0, 1, 1]|
+-------+------------+