Pyspark ValueError:无法将列转换为布尔值:请使用'&';对于';和''|';对于';或''~';对于';不是';构建数据帧布尔表达式时

Pyspark ValueError:无法将列转换为布尔值:请使用'&';对于';和''|';对于';或''~';对于';不是';构建数据帧布尔表达式时,pyspark,drop,Pyspark,Drop,我在使用此代码删除pyspark的嵌套列时遇到此错误。为什么这不起作用?我试着用瓷砖而不是not!=正如错误所暗示的,但它也不起作用。那么在这种情况下你会怎么做 def drop_col(df, struct_nm, delete_struct_child_col_nm): fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select(" {}.*".format(struct_nm)).co

我在使用此代码删除pyspark的嵌套列时遇到此错误。为什么这不起作用?我试着用瓷砖而不是not!=正如错误所暗示的,但它也不起作用。那么在这种情况下你会怎么做

def drop_col(df, struct_nm, delete_struct_child_col_nm):
    fields_to_keep = filter(lambda x:  x != delete_struct_child_col_nm, df.select(" 
{}.*".format(struct_nm)).columns)
    fields_to_keep = list(map(lambda x:  "{}.{}".format(struct_nm, x), fields_to_keep))
    return df.withColumn(struct_nm, struct(fields_to_keep))

我构建了一个简单的示例,其中包含一个结构列和几个虚拟列:

from pyspark import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, lit, col, struct
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
schema = StructType(
    [
        StructField('addresses',
                    StructType(
                        [StructField("state", StringType(), True),
                         StructField("street", StringType(), True),
                        StructField("country", StringType(), True),
                         StructField("code", IntegerType(), True)]
                    )
                    )
    ]
)

rdd = [({'state': 'pa', 'street': 'market', 'country': 'USA', 'code': 100},),
       ({'state': 'ca', 'street': 'baker',  'country': 'USA', 'code': 101},)]

df = sql_context.createDataFrame(rdd, schema)
df = df.withColumn('id', monotonically_increasing_id())
df = df.withColumn('name', lit('test'))

print(df.show())
print(df.printSchema())
输出:

+--------------------+-----------+----+
|           addresses|         id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+

root
 |-- addresses: struct (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- code: integer (nullable = true)
 |-- id: long (nullable = false)
 |-- name: string (nullable = false)
+-----------+----+
|         id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
+----------+-----------+----+
| addresses|         id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
+------------+-----------+----+
|   addresses|         id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
要删除整个struct列,只需使用
drop
函数:

df2 = df.drop('addresses')
print(df2.show())
输出:

+--------------------+-----------+----+
|           addresses|         id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+

root
 |-- addresses: struct (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- code: integer (nullable = true)
 |-- id: long (nullable = false)
 |-- name: string (nullable = false)
+-----------+----+
|         id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
+----------+-----------+----+
| addresses|         id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
+------------+-----------+----+
|   addresses|         id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
在struct列中删除特定字段要复杂一些-这里还有一些其他类似的问题:

在任何情况下,我发现它们都有点复杂-我的方法只是使用要保留的结构字段子集重新分配原始列:

columns_to_keep = ['country', 'code']

df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
输出:

+--------------------+-----------+----+
|           addresses|         id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+

root
 |-- addresses: struct (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- code: integer (nullable = true)
 |-- id: long (nullable = false)
 |-- name: string (nullable = false)
+-----------+----+
|         id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
+----------+-----------+----+
| addresses|         id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
+------------+-----------+----+
|   addresses|         id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
或者,如果您只想指定要删除的列而不是要保留的列,请执行以下操作:

columns_to_keep = ['country', 'code']

df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
输出:

+--------------------+-----------+----+
|           addresses|         id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+

root
 |-- addresses: struct (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- code: integer (nullable = true)
 |-- id: long (nullable = false)
 |-- name: string (nullable = false)
+-----------+----+
|         id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
+----------+-----------+----+
| addresses|         id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
+------------+-----------+----+
|   addresses|         id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+

希望这有帮助

我构建了一个简单的示例,其中包含一个struct列和几个虚拟列:

from pyspark import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, lit, col, struct
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
schema = StructType(
    [
        StructField('addresses',
                    StructType(
                        [StructField("state", StringType(), True),
                         StructField("street", StringType(), True),
                        StructField("country", StringType(), True),
                         StructField("code", IntegerType(), True)]
                    )
                    )
    ]
)

rdd = [({'state': 'pa', 'street': 'market', 'country': 'USA', 'code': 100},),
       ({'state': 'ca', 'street': 'baker',  'country': 'USA', 'code': 101},)]

df = sql_context.createDataFrame(rdd, schema)
df = df.withColumn('id', monotonically_increasing_id())
df = df.withColumn('name', lit('test'))

print(df.show())
print(df.printSchema())
输出:

+--------------------+-----------+----+
|           addresses|         id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+

root
 |-- addresses: struct (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- code: integer (nullable = true)
 |-- id: long (nullable = false)
 |-- name: string (nullable = false)
+-----------+----+
|         id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
+----------+-----------+----+
| addresses|         id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
+------------+-----------+----+
|   addresses|         id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
要删除整个struct列,只需使用
drop
函数:

df2 = df.drop('addresses')
print(df2.show())
输出:

+--------------------+-----------+----+
|           addresses|         id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+

root
 |-- addresses: struct (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- code: integer (nullable = true)
 |-- id: long (nullable = false)
 |-- name: string (nullable = false)
+-----------+----+
|         id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
+----------+-----------+----+
| addresses|         id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
+------------+-----------+----+
|   addresses|         id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
在struct列中删除特定字段要复杂一些-这里还有一些其他类似的问题:

在任何情况下,我发现它们都有点复杂-我的方法只是使用要保留的结构字段子集重新分配原始列:

columns_to_keep = ['country', 'code']

df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
输出:

+--------------------+-----------+----+
|           addresses|         id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+

root
 |-- addresses: struct (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- code: integer (nullable = true)
 |-- id: long (nullable = false)
 |-- name: string (nullable = false)
+-----------+----+
|         id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
+----------+-----------+----+
| addresses|         id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
+------------+-----------+----+
|   addresses|         id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
或者,如果您只想指定要删除的列而不是要保留的列,请执行以下操作:

columns_to_keep = ['country', 'code']

df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
输出:

+--------------------+-----------+----+
|           addresses|         id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+

root
 |-- addresses: struct (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- code: integer (nullable = true)
 |-- id: long (nullable = false)
 |-- name: string (nullable = false)
+-----------+----+
|         id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
+----------+-----------+----+
| addresses|         id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
+------------+-----------+----+
|   addresses|         id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+

希望这有帮助

谢谢你的快速回答。事实上,这对我来说并不完全有效。关于这行“df=df.withColumn('addresses',struct(*[f“addresses.{column}”表示列_-to _-keep])”,我得到一个“无效语法”错误。什么语法错误?如果您使用的是低于3.6的Python版本,则可能没有f-strings(),因此必须使用另一种格式化字符串的方法,即
[“addresses.”+columns-for-columns-in-columns-to-keep]
作为旁注,我实现了到此行列表的转换
columns-to-keep=list(set(all-columns)-set(columns-to-remove))
是不必要的-您可以将其作为一个集合保留,因为我们在下一行中只是对其进行迭代,不会有重复:
列到列保留=集(所有列)-集(列到列删除)
谢谢。事实上,我已经通过将这一行“df=df.withColumn('addresses',struct(*[f“addresses.{column}”表示列中的列_to_keep])”更改为我的问题“fields_to_keep=list(map(lambda x:{}.{}.”格式(struct_nm,x),fields_to_keep))”中引用的代码中的一行来修复了它。尽管你目前的解决方案看起来更容易。我不确定我的python版本可能是这样的——这会导致错误——是的,这也会解决它——这只是改变构建格式化字符串的方式的问题,可以通过“+”或“format”函数。在Python3.6以后的版本中,还可以使用我使用的方法格式化字符串,该方法称为f-strings。很高兴这一切都起作用了!谢谢你的快速回答。事实上,这对我来说并不完全有效。关于这行“df=df.withColumn('addresses',struct(*[f“addresses.{column}”表示列_-to _-keep])”,我得到一个“无效语法”错误。什么语法错误?如果您使用的是低于3.6的Python版本,则可能没有f-strings(),因此必须使用另一种格式化字符串的方法,即
[“addresses.”+columns-for-columns-in-columns-to-keep]
作为旁注,我实现了到此行列表的转换
columns-to-keep=list(set(all-columns)-set(columns-to-remove))
是不必要的-您可以将其作为一个集合保留,因为我们在下一行中只是对其进行迭代,不会有重复:
列到列保留=集(所有列)-集(列到列删除)
谢谢。事实上,我已经通过将这一行“df=df.withColumn('addresses',struct(*[f“addresses.{column}”表示列中的列_to_keep])”更改为我的问题“fields_to_keep=list(map(lambda x:{}.{}.”格式(struct_nm,x),fields_to_keep))”中引用的代码中的一行来修复了它。尽管你目前的解决方案看起来更容易。我不确定我的python版本可能是这样的——这会导致错误——是的,这也会解决它——这只是改变构建格式化字符串的方式的问题,可以通过“+”或“format”函数。在Python3.6以后的版本中,还可以使用我使用的方法格式化字符串,该方法称为f-strings。很高兴这一切都起作用了!