Python LSTM Keras中的生成器函数,用于输出一个文件的小批量

Python LSTM Keras中的生成器函数,用于输出一个文件的小批量,python,r,tensorflow,keras,mini-batch,Python,R,Tensorflow,Keras,Mini Batch,我有一个发电机功能,工作正常。我有一个很大的.txt文件列表,其中每个文件都很长。现在的任务是编写一个生成器函数,该函数采用: 一批文件 然后从一个文件中取出一批大小为128的文件 我现在的代码: data_files_generator <- function(train_set) { files <- train_set next_file <- 0 function() { # move to the next file (note the &l

我有一个发电机功能,工作正常。我有一个很大的.txt文件列表,其中每个文件都很长。现在的任务是编写一个生成器函数,该函数采用:

  • 一批文件
  • 然后从一个文件中取出一批大小为128的文件
  • 我现在的代码:

    data_files_generator <- function(train_set) {
    
      files <- train_set
      next_file <- 0
    
      function() {
    
        # move to the next file (note the <<- assignment operator)
        next_file <<- next_file + 1
    
        # if we've exhausted all of the files then start again at the
        # beginning of the list (keras generators need to yield
        # data infinitely -- termination is controlled by the epochs
        # and steps_per_epoch arguments to fit_generator())
        if (next_file > length(files))
        {next_file <<- 1}
    
        # determine the file name
        file <- files[[next_file]]
    
        text <- read_lines(paste(data_dir, file, sep = "" )) %>%
          str_to_lower() %>%
          str_c(collapse = "\n") %>%
          removeNumbers() %>%
          tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)
    
        text <- text[text %in% chars]
    
        dataset <- map(
          seq(1, length(text) - maxlen - 1, by = 3), 
          ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
        )
    
        dataset <- transpose(dataset)
    
        # Vectorization
        x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
        y <- array(0, dim = c(length(dataset$sentece), length(chars)))
    
        for(i in 1:length(dataset$sentece)){
    
          x[i,,] <- sapply(chars, function(x){
            as.integer(x == dataset$sentece[[i]])
          })
    
          y[i,] <- as.integer(chars == dataset$next_char[[i]])
    
        }
        rounded_dim <- floor(dim(x)[1]/mini_batch_size)
        match_size_to_batch <- 128 * rounded_dim
    
        x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
        y <- y_val[1:match_size_to_batch, 1:length(chars)]
    
        return(list(x, y))
    
      }
    }
    

    希望我已经解释清楚了。我想我必须输入某种for循环来迭代样本长度,但我不知道如何将其包含到gen.函数中。

    根据错误,您试图输入一个shape
    (112512,40,43)
    的对象,但您的LSTM层需要一个shape
    (128,40,43)
    的对象。似乎缺少一些代码,但在定义输入层时,是否正在修复批大小?我很幸运地将我的输入层定义为:

    l_input = Input(shape = (None, num_features), name = 'input_layer')
    
    我怀疑错误是由以下代码行引起的:

    rounded_dim <- floor(dim(x)[1]/mini_batch_size)
    match_size_to_batch <- 128 * rounded_dim
    

    rounded\u dim根据错误,您试图输入一个形状为
    (112512,40,43)
    的对象,但您的LSTM层需要一个形状为
    (128,40,43)
    的对象。似乎缺少一些代码,但在定义输入层时,是否正在修复批大小?我很幸运地将我的输入层定义为:

    l_input = Input(shape = (None, num_features), name = 'input_layer')
    
    我怀疑错误是由以下代码行引起的:

    rounded_dim <- floor(dim(x)[1]/mini_batch_size)
    match_size_to_batch <- 128 * rounded_dim
    

    rounded_dim我已经实现了一个for循环,现在返回大小为128的批:

    更改代码:

    data_files_generator <- function(train_set) {
    
      files <- train_set
      next_file <- 0
    
      function() {
    
        # move to the next file (note the <<- assignment operator)
        next_file <<- next_file + 1
    
        # if we've exhausted all of the files then start again at the
        # beginning of the list (keras generators need to yield
        # data infinitely -- termination is controlled by the epochs
        # and steps_per_epoch arguments to fit_generator())
        if (next_file > length(files))
        {next_file <<- 1}
    
        # determine the file name
        file <- files[[next_file]]
    
        text <- read_lines(paste(data_dir, file, sep = "" )) %>%
          str_to_lower() %>%
          str_c(collapse = "\n") %>%
          removeNumbers() %>%
          tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)
    
        text <- text[text %in% chars]
    
        dataset <- map(
          seq(1, length(text) - maxlen - 1, by = 3), 
          ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
        )
    
        dataset <- transpose(dataset)
    
        # Vectorization
        x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
        y <- array(0, dim = c(length(dataset$sentece), length(chars)))
    
        for(i in 1:length(dataset$sentece)){
    
          x[i,,] <- sapply(chars, function(x){
            as.integer(x == dataset$sentece[[i]])
          })
    
          y[i,] <- as.integer(chars == dataset$next_char[[i]])
    
        }
        rounded_dim <- floor(dim(x)[1]/mini_batch_size)
        match_size_to_batch <- 128 * rounded_dim
    
        x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
        y <- y_val[1:match_size_to_batch, 1:length(chars)]
    
        #Edit:
        span_start <-1
        for (iter in 1:rounded_dim){
         i <- iter * 128
         span_end <- iter * 128
         x <- x[span_start:span_end, 1:maxlen, 1:length(chars)]
         y <- y[span_start:span_end, 1:length(chars)]
         span_start <- i
         return(list(x, y))
        }
      }
    }
    

    data\u files\u generator我实现了一个for循环,它现在返回大小为128的批:

    更改代码:

    data_files_generator <- function(train_set) {
    
      files <- train_set
      next_file <- 0
    
      function() {
    
        # move to the next file (note the <<- assignment operator)
        next_file <<- next_file + 1
    
        # if we've exhausted all of the files then start again at the
        # beginning of the list (keras generators need to yield
        # data infinitely -- termination is controlled by the epochs
        # and steps_per_epoch arguments to fit_generator())
        if (next_file > length(files))
        {next_file <<- 1}
    
        # determine the file name
        file <- files[[next_file]]
    
        text <- read_lines(paste(data_dir, file, sep = "" )) %>%
          str_to_lower() %>%
          str_c(collapse = "\n") %>%
          removeNumbers() %>%
          tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)
    
        text <- text[text %in% chars]
    
        dataset <- map(
          seq(1, length(text) - maxlen - 1, by = 3), 
          ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
        )
    
        dataset <- transpose(dataset)
    
        # Vectorization
        x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
        y <- array(0, dim = c(length(dataset$sentece), length(chars)))
    
        for(i in 1:length(dataset$sentece)){
    
          x[i,,] <- sapply(chars, function(x){
            as.integer(x == dataset$sentece[[i]])
          })
    
          y[i,] <- as.integer(chars == dataset$next_char[[i]])
    
        }
        rounded_dim <- floor(dim(x)[1]/mini_batch_size)
        match_size_to_batch <- 128 * rounded_dim
    
        x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
        y <- y_val[1:match_size_to_batch, 1:length(chars)]
    
        #Edit:
        span_start <-1
        for (iter in 1:rounded_dim){
         i <- iter * 128
         span_end <- iter * 128
         x <- x[span_start:span_end, 1:maxlen, 1:length(chars)]
         y <- y[span_start:span_end, 1:length(chars)]
         span_start <- i
         return(list(x, y))
        }
      }
    }
    

    data\u files\u生成器是正确的,这是我的问题。我想以这种方式更改代码,以获得大小为128的批。我想我已经设法做到了,但我不确定是否会返回整个文本或只返回最后一批。我将在问题中编辑我的代码。没错,这是我的问题。我想以这种方式更改代码,以获得大小为128的批。我想我已经设法做到了,但我不确定是否会返回整个文本或只返回最后一批。我将编辑问题中的代码。您是否仍然收到错误?我想您还需要
    span_start no error:)为什么我需要+1?我想没关系。Thx无论如何,您的范围将如下所示:1:128,然后128:256,然后256:384,依此类推,因此一些样本将在单独的批次中出现两次。这可能没什么大不了的,但这是需要注意的。你仍然会犯错误吗?我想您还需要
    span_start no error:)为什么我需要+1?我想没关系。Thx无论如何,您的范围将如下所示:1:128,然后128:256,然后256:384,依此类推,因此一些样本将在单独的批次中出现两次。这可能没什么大不了的,但这是需要注意的。