R:快速生成部分序列

R:快速生成部分序列,r,nlp,purrr,R,Nlp,Purrr,我希望在文本片段上训练RNN的基础上生成文本序列(我在中已经做过) 一个步骤是提取文本片段并将其分解为子序列,以便对模型进行以下方面的训练: c("E","X","A","M","P","L","E") 将成为 c("E") c("E","X") c("E","X","A") ... 我目前的方法是在每个单词上使用地图: require(tidyverse) data <- data_frame(id = c(1,2),word = list(c("E","X","A","M","P

我希望在文本片段上训练RNN的基础上生成文本序列(我在中已经做过)

一个步骤是提取文本片段并将其分解为子序列,以便对模型进行以下方面的训练:

c("E","X","A","M","P","L","E")
将成为

c("E")
c("E","X")
c("E","X","A")
...
我目前的方法是在每个单词上使用地图:

require(tidyverse)

data <- data_frame(id = c(1,2),word = list(c("E","X","A","M","P","L","E"), c("R","S","T","U","D","I","O")))

result <- data %>%
  pmap(function(id,word){
    subs <- map(1:length(word),function(i) word[1:i])
    data_frame(id = id, sub = subs)
  }) %>%
  bind_rows()
require(tidyverse)

数据事实证明,问题在于在map函数中调用
data\u frame
。显然,创建数据帧很慢。如果您放弃使用数据帧而坚持使用列表,则可以更快地完成:

result <- data %>%
  pmap(function(id,word){
    map(1:length(word),function(i) list(id = id, sub = word[1:i]))
  }) %>%
  purrr::flatten()
结果%
pmap(功能(id、字){
映射(1:length(word),function(i)list(id=id,sub=word[1:i]))
}) %>%
purrr::flatten()

我希望通过使用
bind_rows()
在最后将其全部转换为一个
data\u框架,但由于某些原因,该函数不能用于列表列。

您正在寻找
Reduce
with
accumulate=TRUE

Reduce(c,a,accumulate = T)
[[1]]
[1] "E"

[[2]]
[1] "E" "X"

[[3]]
[1] "E" "X" "A"

[[4]]
[1] "E" "X" "A" "M"

[[5]]
[1] "E" "X" "A" "M" "P"

[[6]]
[1] "E" "X" "A" "M" "P" "L"

[[7]]
[1] "E" "X" "A" "M" "P" "L" "E"
因此,要将其包含在数据中,您可以执行以下操作:

data%>%
  group_by(id)%>%
  mutate(word=list(Reduce(c,unlist(word),accumulate = T)))%>%
  unnest()
要在
purrr
中执行相同操作,请使用函数
accumulate

purrr::累加(a,c)

虽然这是
purrr
中的一个函数,但它基本上是在调用
Reduce
函数。即

purrr::accumulate
function (.x, .f, ..., .init) 
{
    .f <- as_mapper(.f, ...)
    f <- function(x, y) {
        .f(x, y, ...)
    }
    Reduce(f, .x, init = .init, accumulate = TRUE)#THIS IS USING THE BASE FUNCTION Reduce
}
<environment: namespace:purrr>
purrr::累加
函数(.x、.f、….init)
{

.f在这里使用lappy和Reduce可能更快

x <- lapply(data$word, function(w){
    Reduce(c, w, accumulate = TRUE)}
x
id2 <- rep(id, unlist(lapply(x, length)))

data2 <- data_frame(id2, subs=unlist(x, recursive=FALSE))