Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 将复杂的PSQL查询转换为SparkSQL时出现不清楚的分析错误_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql - Fatal编程技术网

Apache spark 将复杂的PSQL查询转换为SparkSQL时出现不清楚的分析错误

Apache spark 将复杂的PSQL查询转换为SparkSQL时出现不清楚的分析错误,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我一直在对数据库做一些分析,最近换成了spark,因为csv大于100GB,对于一台机器来说太大了 我的大多数查询运行正常,但是,以下问题似乎存在: psql = "select b.*, "\ "(select count(distinct c.notice_sender) from lumen_sender_duplicate_utility c where c.domain_name = b.domain_name and cast(c.num_of_dup_urls as int) =

我一直在对数据库做一些分析,最近换成了spark,因为csv大于100GB,对于一台机器来说太大了

我的大多数查询运行正常,但是,以下问题似乎存在:

psql = "select b.*, "\
"(select count(distinct c.notice_sender) from lumen_sender_duplicate_utility c where c.domain_name = b.domain_name and cast(c.num_of_dup_urls as int) = 0 ) num_of_distinct_senders "\
"from (select a.domain_name, sum(a.num_of_url) total_num_urls, "\
"sum(a.num_of_dup_urls) total_num_dup_urls, count(distinct a.notice_sender) total_num_senders " \
"from lumen_sender_duplicate a group by a.domain_name) b"
在我对其进行更改时,出现了各种各样的错误,但最近的错误如下:(完整堆栈跟踪可在上获得)

起初,我认为这是因为一些特性,如distinct或subquery不可用,但我使用的是spark 2.4,因此一切似乎都很好。(我还分别测试了每个组件,似乎没有问题)。如果有人知道我哪里出了问题,任何帮助都将不胜感激

Caused by: java.lang.RuntimeException: Couldn't find count(DISTINCT notice_sender)#419L in [domain_name#13,sum(cast(num_of_url#14 as double))#415,sum(cast(num_of_dup_urls#16 as double))#416,count(notice_sender#10)#417L]
     at scala.sys.package$.error(package.scala:27)