Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何分配作业以避免内存不足_Apache Spark_Pyspark_Apache Spark Sql_Azure Hdinsight_Pyspark Sql - Fatal编程技术网

Apache spark 如何分配作业以避免内存不足

Apache spark 如何分配作业以避免内存不足,apache-spark,pyspark,apache-spark-sql,azure-hdinsight,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Azure Hdinsight,Pyspark Sql,我正在尝试运行一些spark作业,但执行者通常会耗尽内存: 17/02/06 19:12:02 WARN TaskSetManager: Lost task 10.0 in stage 476.3 (TID 133250, 10.0.0.10): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1486

我正在尝试运行一些spark作业,但执行者通常会耗尽内存:

17/02/06 19:12:02 WARN TaskSetManager: Lost task 10.0 in stage 476.3 (TID 133250, 10.0.0.10): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1486378087852_0006_01_000019 on host: 10.0.0.10. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_1486378087852_0006_01_000019
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
因为我已经设置了
spark.executor.memory=20480m
,我觉得这个作业实际上不需要更多的RAM来工作,所以我看到的另一个选择是增加分区的数量

我试过:

>>> sqlContext.setConf("spark.sql.shuffle.partitions", u"2001")
>>> sqlContext.getConf("spark.sql.shuffle.partitions")
u'2001'

但是,当我启动作业时,仍然会看到默认的200个分区:

>>> all_users.repartition(2001).show()
[Stage 526:(0 + 30) / 200][Stage 527:>(0 + 0) / 126][Stage 528:>(0 + 0) / 128]0]
我正在Azure HDInsight上使用PySpark 2.0.2。有人能指出我做错了什么吗

编辑

根据下面的答案,我试着:

sqlContext.setConf('spark.sql.shuffle.partitions', 2001)
刚开始的时候,但没用。然而,这是有效的:

sqlContext.setConf('spark.sql.files.maxPartitionBytes', 100000000)
所有用户都是sql数据帧。一个具体的例子是:

all_users = sqlContext.table('RoamPositions')\ 
    .withColumn('prev_district_id', F.lag('district_id', 1).over(user_window))\ 
    .withColumn('prev_district_name', F.lag('district_name', 1).over(user_window))\
    .filter('prev_district_id IS NOT NULL AND prev_district_id != district_id')\
    .select('timetag', 'imsi', 'prev_district_id', 'prev_district_name', 'district_id', 'district_name')

根据您的评论,在调用
重新分区之前,您似乎从外部源读取数据并使用窗口函数。窗口功能:

  • 如果未提供
    partitionBy
    子句,则将数据重新分区到单个分区
  • 如果提供了
    partitionBy
    子句,请使用标准的随机播放机制
这里的情况似乎是后者。由于
spark.sql.shuffle.partition
的默认值是200,因此在重新分区之前,您的数据将被洗牌到200个分区中。如果您想一直使用2001,则应在加载数据之前将其设置为

sqlContext.setConf(“spark.sql.shuffle.partitions”,u“2001”)
所有用户=。。。
另外,
spark.sql.shuffle.partitions
不会影响初始分区的数量。这些可以使用其他属性进行控制:

all_users = sqlContext.table('RoamPositions')\ 
    .withColumn('prev_district_id', F.lag('district_id', 1).over(user_window))\ 
    .withColumn('prev_district_name', F.lag('district_name', 1).over(user_window))\
    .filter('prev_district_id IS NOT NULL AND prev_district_id != district_id')\
    .select('timetag', 'imsi', 'prev_district_id', 'prev_district_name', 'district_id', 'district_name')