Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/mongodb/11.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用MongoDB的Spark connector匹配空值?_Mongodb_Apache Spark_Pyspark_Aggregation Framework - Fatal编程技术网

如何使用MongoDB的Spark connector匹配空值?

如何使用MongoDB的Spark connector匹配空值?,mongodb,apache-spark,pyspark,aggregation-framework,Mongodb,Apache Spark,Pyspark,Aggregation Framework,我正在尝试使用带有pyspark MongoDB连接器的聚合函数查询MongoDB集合,但无法执行与null的匹配 我已经在管道中尝试过: {'$match' : {'deleted_at': null}} {'$match' : {'deleted_at': 'null'}} {'$match' : {'deleted_at': None}} {'$match' : {'deleted_at': False}} {'$match' : {'deleted_at': 0}} 但似乎什么都不管用

我正在尝试使用带有pyspark MongoDB连接器的聚合函数查询MongoDB集合,但无法执行与null的匹配

我已经在管道中尝试过:

{'$match' : {'deleted_at': null}}
{'$match' : {'deleted_at': 'null'}}
{'$match' : {'deleted_at': None}}
{'$match' : {'deleted_at': False}}
{'$match' : {'deleted_at': 0}}

但似乎什么都不管用。有什么想法吗?

我找到了一个可能的解决方案,可以避免更改所有查询。解决方案是匹配以下类型:

{'$match' : 'deleted_at': { '$type': 10 }}
因为10对应于null类型

,所以您可以利用Spark中的下推过滤器(默认情况下) 将
filter
s与DataFrames或Python API一起使用时,底层Mongo连接器代码构造一个聚合管道,在将数据发送到Spark之前过滤MongoDB中的数据

Python代码
你用Spark SQL尝试过df.filter($“deleted_at”==null)吗?也许你可以在mongo db中尝试一下。因此,您可以确认Spark正在使用过滤器构建mongo db聚合管道。
from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.coll") \
    .getOrCreate()

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()

filtrDf = df.filter(df['deleted_at'] == 'null')

filtrDf.explain() // check for physical plan of this output