Apache spark 筛选pyspark数据帧时出现问题,如果包含&燃气轮机&引用;或<&引用;
我的数据框有Apache spark 筛选pyspark数据帧时出现问题,如果包含&燃气轮机&引用;或<&引用;,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我的数据框有value列,其中包含或这肯定是由value列中的空值造成的 df.count()。但是当您在筛选器中使用contains时,将跳过空值 示例: data = [("value1_>", ), ("value2_>", ), ("value3_<",), ("value4",), (None,)] df = spark.createDataFrame(data, ['value']) df1 = df.filter((col("value").contains(
value
列,其中包含
或这肯定是由value
列中的空值造成的
df.count()。但是当您在筛选器中使用contains
时,将跳过空值
示例:
data = [("value1_>", ), ("value2_>", ), ("value3_<",), ("value4",), (None,)]
df = spark.createDataFrame(data, ['value'])
df1 = df.filter((col("value").contains('>') | col("value").contains('<')))
df2 = df.filter(~(col("value").contains('>') | col("value").contains('<')))
print(df.count())
print(df1.count())
print(df2.count())
#5
#3
#1
数据=[(“值1>,),(“值2>,),(“值3_
3900000
202
3600000
df.count() = df1.count() + df2.count()
data = [("value1_>", ), ("value2_>", ), ("value3_<",), ("value4",), (None,)]
df = spark.createDataFrame(data, ['value'])
df1 = df.filter((col("value").contains('>') | col("value").contains('<')))
df2 = df.filter(~(col("value").contains('>') | col("value").contains('<')))
print(df.count())
print(df1.count())
print(df2.count())
#5
#3
#1