Apache spark TimeoutException:Spark SQL查询_Apache Spark_Pyspark_Apache Spark Sql

Apache spark TimeoutException:Spark SQL查询

apache-spark pyspark

Apache spark TimeoutException:Spark SQL查询,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我正在努力实现这个Spark SQL查询，它使用一个简单的连接和一些逻辑条件我可以用相对较小的数据集获得输出，但是，情况随着较大的数据集而变化。我想让这个连接在A1400万行和B100万行的数据集上运行我使用的是一个包含10个r4.4xlarge实例的EMR集群以下是我要传递给作业的配置参数： spark.driver.memory 100g spark.executor.cores 5 spark.executor.memory 39g 用于创建SparkSession的参数

我正在努力实现这个Spark SQL查询，它使用一个简单的连接和一些逻辑条件

我可以用相对较小的数据集获得输出，但是，情况随着较大的数据集而变化。我想让这个连接在A1400万行和B100万行的数据集上运行
我使用的是一个包含10个r4.4xlarge实例的EMR集群
以下是我要传递给作业的配置参数：

spark.driver.memory 100g spark.executor.cores 5 spark.executor.memory 39g
用于创建SparkSession的参数如下：

sq = SparkSession.builder.config('spark.rpc.message.maxSize', '1536')\ .config("spark.sql.shuffle.partitions", 490)\ .config("spark.sql.broadcastTimeout", 2000)\ .config("spark.sql.autoBroadcastJoinThreshold", 1024*1024*900)\ .getOrCreate()
数据集A和B是通过工作流获得的，但只有在该查询中，流程才会终止

sql_1 = """ SELECT A.userid, A.eventtime, A.latitude as userid_latitude, A.longitude as userid_longitude, A.events, B.unique_reference_number, B.name as poi_name, B.pointx_classification_name as poi_classification_name, B.brand, B.lat as poi_latitude, B.long as poi_longitude, acos(sin(pi()*A.latitude/180.0)*sin(pi()*B.lat/180.0)+cos(pi()*A.latitude/180.0)*cos(pi()*B.lat/180.0)*cos(pi()*B.long/180.0-pi()*A.longitude/180.0))*6371 as distance, B.poi_radious_meters/1000 as poi_radious_km, CASE WHEN (acos(sin(pi()*A.latitude/180.0)*sin(pi()*B.lat/180.0)+cos(pi()*A.latitude/180.0)*cos(pi()*B.lat/180.0)*cos(pi()*B.long/180.0-pi()*A.longitude/180.0))*6371) <= B.poi_radious_meters/1000 THEN 1 ELSE 0 END as is_within_radius FROM A LEFT JOIN B ON array_contains(B.grid_array, A.grid_id) WHERE (acos(sin(pi()*A.latitude/180.0)*sin(pi()*B.lat/180.0)+cos(pi()*A.latitude/180.0)*cos(pi()*B.lat/180.0)*cos(pi()*B.long/180.0-pi()*A.longitude/180.0))*6371) <= 0.6 """ interim = sq.sql(sql_1) # Aggregate the events output = interim.groupBy("userid", "eventtime", "unique_reference_number").agg((F.sum('events')).alias("events"))

sql_1=”“” 挑选 A.userid， A.事件时间， A.纬度作为用户ID_纬度， A.经度作为用户ID_经度， A.事件， B.唯一参考号， B.名称作为poi_名称， B.pointx_分类名称作为poi_分类名称， B.品牌， B.纬度作为纬度，只要是经度， acos（sin（pi（）*A.lation/180.0）*sin（pi（）*B.lat/180.0）+cos（pi（）*A.lation/180.0）*cos（pi（）*B.lat/180.0）*cos（pi（）*B.long/180.0-pi（）*A.longitude/180.0））*6371作为距离， B.poi_radious_米/1000作为poi_radious_千米，当（acos（sin（pi（）*A.lation/180.0）*sin（pi（）*B.lat/180.0）+cos（pi（）*A.lation/180.0）*cos（pi（）*B.lat/180.0）*cos（pi（）*B.long/180.0-pi（）*A.longitude/180.0））*6371）为什么使用spark.sql.autoBroadcastJoinThreshold？看起来数据广播太大，网络速度不太快，超时-->”BroadcastExchangeExec.doExecuteBroadcast（BroadcastExchangeExec.scala:131）“以前，我没有它，但是出现了相同的错误。我对它做了一个快速的研究，然后我决定添加它，看看情况是否可以改变，但遗憾的是它没有。你能删除所有其他”东西吗“你认为没有帮助，重新开始。我很好奇是什么导致了它，并且可能会有帮助（但没有人告诉我这不是必需的或类似的）。好吧，我会在没有这些东西的情况下再次运行它。我会在几分钟后让你知道我在问题中添加了日志错误，当我不使用广播参数时 : org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doExecute(BroadcastNestedLoopJoinExec.scala:343) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235) at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:96) at org.apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:85) at org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:41) at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:98) at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65) at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89) at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479) at org.apache.spark.sql.Dataset.cache(Dataset.scala:2489) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [2000 seconds] org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doExecute(BroadcastNestedLoopJoinExec.scala:343) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235) at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:96) at org.apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:85) at org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:41) at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:98) at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65) at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89) at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479) at org.apache.spark.sql.Dataset.cache(Dataset.scala:2489) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job 14 cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:809) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:807) at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:807) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1738) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1657) at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) at org.apache.spark.SparkContext.stop(SparkContext.scala:1825) at org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:1770) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:629) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:934) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:78) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:75) at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:94) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:74) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:74) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)