Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark TimeoutException:Spark SQL查询_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Apache spark TimeoutException:Spark SQL查询

Apache spark TimeoutException:Spark SQL查询,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我正在努力实现这个Spark SQL查询,它使用一个简单的连接和一些逻辑条件 我可以用相对较小的数据集获得输出,但是,情况随着较大的数据集而变化。我想让这个连接在A1400万行和B100万行的数据集上运行 我使用的是一个包含10个r4.4xlarge实例的EMR集群 以下是我要传递给作业的配置参数: spark.driver.memory 100g spark.executor.cores 5 spark.executor.memory 39g 用于创建SparkSession的参数

我正在努力实现这个Spark SQL查询,它使用一个简单的连接和一些逻辑条件

我可以用相对较小的数据集获得输出,但是,情况随着较大的数据集而变化。我想让这个连接在A1400万行和B100万行的数据集上运行

我使用的是一个包含10个r4.4xlarge实例的EMR集群

以下是我要传递给作业的配置参数:

spark.driver.memory 100g
spark.executor.cores    5
spark.executor.memory   39g
用于创建SparkSession的参数如下:

sq = SparkSession.builder.config('spark.rpc.message.maxSize', '1536')\
    .config("spark.sql.shuffle.partitions", 490)\
    .config("spark.sql.broadcastTimeout", 2000)\
    .config("spark.sql.autoBroadcastJoinThreshold", 1024*1024*900)\
    .getOrCreate()
数据集A和B是通过工作流获得的,但只有在该查询中,流程才会终止

sql_1 = """
        SELECT
          A.userid,
          A.eventtime,
          A.latitude as userid_latitude,
          A.longitude as userid_longitude,
          A.events,
          B.unique_reference_number,
          B.name as poi_name,
          B.pointx_classification_name as poi_classification_name,
          B.brand,
          B.lat as poi_latitude,
          B.long as poi_longitude,
          acos(sin(pi()*A.latitude/180.0)*sin(pi()*B.lat/180.0)+cos(pi()*A.latitude/180.0)*cos(pi()*B.lat/180.0)*cos(pi()*B.long/180.0-pi()*A.longitude/180.0))*6371 as distance,
          B.poi_radious_meters/1000 as poi_radious_km,
        CASE WHEN (acos(sin(pi()*A.latitude/180.0)*sin(pi()*B.lat/180.0)+cos(pi()*A.latitude/180.0)*cos(pi()*B.lat/180.0)*cos(pi()*B.long/180.0-pi()*A.longitude/180.0))*6371) <= B.poi_radious_meters/1000 THEN 1 ELSE 0 END as is_within_radius
        FROM A
        LEFT JOIN B ON array_contains(B.grid_array, A.grid_id)
        WHERE (acos(sin(pi()*A.latitude/180.0)*sin(pi()*B.lat/180.0)+cos(pi()*A.latitude/180.0)*cos(pi()*B.lat/180.0)*cos(pi()*B.long/180.0-pi()*A.longitude/180.0))*6371) <= 0.6
        """ 
interim = sq.sql(sql_1)

# Aggregate the events
output = interim.groupBy("userid", "eventtime", "unique_reference_number").agg((F.sum('events')).alias("events"))
sql_1=”“”
挑选
A.userid,
A.事件时间,
A.纬度作为用户ID_纬度,
A.经度作为用户ID_经度,
A.事件,
B.唯一参考号,
B.名称作为poi_名称,
B.pointx_分类名称作为poi_分类名称,
B.品牌,
B.纬度作为纬度,
只要是经度,
acos(sin(pi()*A.lation/180.0)*sin(pi()*B.lat/180.0)+cos(pi()*A.lation/180.0)*cos(pi()*B.lat/180.0)*cos(pi()*B.long/180.0-pi()*A.longitude/180.0))*6371作为距离,
B.poi_radious_米/1000作为poi_radious_千米,

当(acos(sin(pi()*A.lation/180.0)*sin(pi()*B.lat/180.0)+cos(pi()*A.lation/180.0)*cos(pi()*B.lat/180.0)*cos(pi()*B.long/180.0-pi()*A.longitude/180.0))*6371)为什么使用
spark.sql.autoBroadcastJoinThreshold
?看起来数据广播太大,网络速度不太快,超时-->”BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131)“以前,我没有它,但是出现了相同的错误。我对它做了一个快速的研究,然后我决定添加它,看看情况是否可以改变,但遗憾的是它没有。你能删除所有其他”东西吗“你认为没有帮助,重新开始。我很好奇是什么导致了它,并且可能会有帮助(但没有人告诉我这不是必需的或类似的)。好吧,我会在没有这些东西的情况下再次运行它。我会在几分钟后让你知道我在问题中添加了日志错误,当我不使用广播参数时
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
    at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doExecute(BroadcastNestedLoopJoinExec.scala:343)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
    at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
    at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
    at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:96)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:85)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:41)
    at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:98)
    at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65)
    at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89)
    at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479)
    at org.apache.spark.sql.Dataset.cache(Dataset.scala:2489)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [2000 seconds]
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
    at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doExecute(BroadcastNestedLoopJoinExec.scala:343)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
    at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
    at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
    at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:96)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:85)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:41)
    at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:98)
    at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65)
    at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89)
    at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479)
    at org.apache.spark.sql.Dataset.cache(Dataset.scala:2489)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job 14 cancelled because SparkContext was shut down
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:809)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:807)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
    at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:807)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1738)
    at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
    at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1657)
    at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1825)
    at org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:1770)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:629)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:78)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:75)
    at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:94)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:74)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:74)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)