Apache spark 从Spark cluster收集数据时出现内存不足错误_Apache Spark_Memory_Sparklyr

Apache spark 从Spark cluster收集数据时出现内存不足错误

apache-spark memory

Apache spark 从Spark cluster收集数据时出现内存不足错误,apache-spark,memory,sparklyr,Apache Spark,Memory,Sparklyr,我知道在Spark上有很多关于内存不足错误的问题，但我还没有找到解决我的问题的方法我有一个简单的工作流程：从AmazonS3读取ORC文件 filter向下筛选到一小部分行选择一小部分列将收集到驱动程序节点中（以便我可以在R中执行其他操作）当我运行上面的程序，然后缓存表以激发内存时，它会占用当你说在数据帧上收集时，会发生两件事首先，所有数据都必须写入驱动程序的输出驱动程序必须从所有节点收集数据并保存在内存中答复: 如果只想将数据加载到执行器的内存中，count（）也是一个将数据加

我知道在Spark上有很多关于内存不足错误的问题，但我还没有找到解决我的问题的方法

我有一个简单的工作流程：

从AmazonS3读取ORC文件

filter

向下筛选到一小部分行

选择一小部分列


将
收集到驱动程序节点中（以便我可以在R
中执行其他操作）
当我运行上面的程序，然后缓存表以激发内存时，它会占用当你说在数据帧上收集时，会发生两件事
首先，所有数据都必须写入驱动程序的输出
驱动程序必须从所有节点收集数据并保存在内存中
答复:
如果只想将数据加载到执行器的内存中，count（）也是一个将数据加载到执行器内存中的操作，其他进程可以使用该内存
如果要提取数据，请在脉冲数据时尝试此操作以及其他属性“-conf spark.driver.maxrultsize=10g”。如上所述，“缓存”不是操作，请检查
:
但“collect”是一个动作，所有计算（包括“缓存”）都将在调用“collect”时启动
您以独立模式运行应用程序，这意味着，初始数据加载和所有计算将在同一内存中执行
数据下载和其他计算使用的是大部分内存，而不是“收集”。
您可以通过将“collect”替换为“count”来检查它。
cache（）
实际上不会强制任何计算，它只会将数据帧标记为cahced。所有计算以及缓存都将在您对数据帧执行操作后发生，例如count（）
或first（）
。关于这一点，您的代码是否在数据帧的顶部进行了一些转换？像map还是reduce函数？或者您正在使用DataFrame API？您是如何执行应用程序的？资源管理器的用途是什么（spark self/Thread）？@Thiago Baldim除了上面提到的以外，没有任何转换。这些都是用SparkyR编写的，据我所知，它会被翻译成Spark SQL。@jay，作为一个调查练习，你能做以下几点吗：不要将数据收集到驱动程序，先尝试将数据写回s3，然后检查数据的真实量（是的，在内存中会有所不同，但它会给你基本的感觉）感谢您提供有关调用collect
时发生的情况的信息。至于建议，我已经按照我的原始问题中所述实现了这两个。正如原始问题中所述，cache
的SPARKYR版本在表上执行计数<代码>缓存
[即

计数]运行良好-仅当我在之后立即调用

收集

时，才会发生OOM错误。你能解释一下吗？这个问题的标题看起来像是关于Spark的，实际上是关于单独的引擎SparkyR，它可以自己进行“缓存/收集”计算。猜猜看，这很混乱。不是一个单独的引擎，只是一个前端的火花。但是同意，SparkyR调用中的

cache

在Spark中的

count

是令人困惑的。也许，“dataFrame.rdd.count”也会导致OutOfMemory而不是“collect”？

#__________________________________________________________________________________________________________________________________

# Set parameters used for filtering rows
#__________________________________________________________________________________________________________________________________

firstDate <- '2017-07-01'
maxDate <- '2017-08-31'
advertiserID <- '4529611'
advertiserID2 <- '4601141'
advertiserID3 <- '4601141'

library(dplyr)
library(stringr)
library(sparklyr)

#__________________________________________________________________________________________________________________________________

# Configure & connect to spark
#__________________________________________________________________________________________________________________________________

Sys.setenv("SPARK_MEM"="100g")
Sys.setenv(HADOOP_HOME="C:/Users/Jay.Ruffell/AppData/Local/rstudio/spark/Cache/spark-2.0.1-bin-hadoop2.7/tmp/hadoop") 

config <- spark_config()
config$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.3" # used to connect to S3
Sys.setenv(AWS_ACCESS_KEY_ID="")
Sys.setenv(AWS_SECRET_ACCESS_KEY="") # setting these blank ensures that AWS uses the IAM roles associated with the cluster to define S3 permissions

# Specify memory parameters - have tried lots of different values here!
config$`sparklyr.shell.driver-memory` <- '50g' 
config$`sparklyr.shell.executor-memory` <- '50g'
config$spark.driver.maxResultSize <- '50g'
sc <- spark_connect(master='local', config=config, version='2.0.1')

#__________________________________________________________________________________________________________________________________

# load data into spark from S3 ----
#__________________________________________________________________________________________________________________________________

#+++++++++++++++++++
# create spark table (not in memory yet) of all logfiles within logfiles path
#+++++++++++++++++++

spark_session(sc) %>%
  invoke("read") %>% 
  invoke("format", "orc") %>%
  invoke("load", 's3a://nz-omg-ann-aipl-data-lake/aip-connect-256537/orc-files/dcm-log-files/dt2-facts') %>% 
  invoke("createOrReplaceTempView", "alldatadf") 
alldftbl <- tbl(sc, 'alldatadf') # create a reference to the sparkdf without loading into memory

#+++++++++++++++++++
# define variables used to filter table down to daterange
#+++++++++++++++++++

# Calculate firstDate & maxDate as unix timestamps
unixTime_firstDate <- as.numeric(as.POSIXct(firstDate))+1
unixTime_maxDate <- as.numeric(as.POSIXct(maxDate)) + 3600*24-1

# Convert daterange params into date_year, date_month & date_day values to pass to filter statement
dateRange <- as.character(seq(as.Date(firstDate), as.Date(maxDate), by=1))
years <- unique(substring(dateRange, first=1, last=4))
if(length(years)==1) years <- c(years, years)
year_y1 <- years[1]; year_y2 <- years[2]
months_y1 <- substring(dateRange[grepl(years[1], dateRange)], first=6, last=7)
minMonth_y1 <- min(months_y1)
maxMonth_y1 <- max(months_y1)
months_y2 <- substring(dateRange[grepl(years[2], dateRange)], first=6, last=7)
minMonth_y2 <- min(months_y2)
maxMonth_y2 <- max(months_y2) 

# Repeat for 1 day prior to first date & one day after maxdate (because of the way logfile orc partitions are created, sometimes touchpoints can end up in the wrong folder by 1 day. So read in extra days, then filter by event time)
firstDateMinusOne <- as.Date(firstDate)-1
firstDateMinusOne_year <- substring(firstDateMinusOne, first=1, last=4)
firstDateMinusOne_month <- substring(firstDateMinusOne, first=6, last=7) 
firstDateMinusOne_day <- substring(firstDateMinusOne, first=9, last=10)
maxDatePlusOne <- as.Date(maxDate)+1
maxDatePlusOne_year <- substring(maxDatePlusOne, first=1, last=4)
maxDatePlusOne_month <- substring(maxDatePlusOne, first=6, last=7)
maxDatePlusOne_day <- substring(maxDatePlusOne, first=9, last=10)

#+++++++++++++++++++
# Read in data, filter & select
#+++++++++++++++++++

# startTime <- proc.time()[3]
dftbl <- alldftbl %>% # create a reference to the sparkdf without loading into memory

  # filter by month and year, using ORC partitions for extra speed
  filter(((date_year==year_y1  & date_month>=minMonth_y1 & date_month<=maxMonth_y1) |
            (date_year==year_y2 & date_month>=minMonth_y2 & date_month<=maxMonth_y2) |
            (date_year==firstDateMinusOne_year & date_month==firstDateMinusOne_month & date_day==firstDateMinusOne_day) |
            (date_year==maxDatePlusOne_year & date_month==maxDatePlusOne_month & date_day==maxDatePlusOne_day))) %>%

  # filter to be within firstdate & maxdate. Note that event_time_char will be in UTC, so 12hrs behind.
  filter(event_time>=(unixTime_firstDate*1000000) & event_time<(unixTime_maxDate*1000000)) %>%

  # filter by advertiser ID
  filter(((advertiser_id==advertiserID | advertiser_id==advertiserID2 | advertiser_id==advertiserID3) & 
            !is.na(advertiser_id)) |
           ((floodlight_configuration==advertiserID | floodlight_configuration==advertiserID2 | 
               floodlight_configuration==advertiserID3) & !is.na(floodlight_configuration)) & user_id!="0") %>%

  # Define cols to keep
  transmute(time=as.numeric(event_time/1000000),
            user_id=as.character(user_id),
            action_type=as.character(if(fact_type=='click') 'C' else if(fact_type=='impression') 'I' else if(fact_type=='activity') 'A' else NA),
            lookup=concat_ws("_", campaign_id, ad_id, site_id_dcm, placement_id),
            activity_lookup=as.character(activity_id),
            sv1=as.character(segment_value_1),
            other_data=as.character(other_data))  %>%
  mutate(time_char=as.character(from_unixtime(time)))

# cache to memory
dftbl <- sdf_register(dftbl, "filtereddf")
tbl_cache(sc, "filtereddf")

#__________________________________________________________________________________________________________________________________

# Collect out of spark
#__________________________________________________________________________________________________________________________________

myDF <- collect(dftbl)

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes.