R 创建新变量,该变量考虑了来自早期记录的先前信息
我有如下数据,我想创建一个新的变量,该变量考虑了上一时期的上述信息。比如说,R 创建新变量,该变量考虑了来自早期记录的先前信息,r,dplyr,data.table,tidyr,R,Dplyr,Data.table,Tidyr,我有如下数据,我想创建一个新的变量,该变量考虑了上一时期的上述信息。比如说, moviewatched<- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama') name<- c('john', 'john', 'john', 'john', 'john','kate','kate') time<- c('1-2018', '1-2018', '1-2018', '2-2018', '2-20
moviewatched<- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama')
name<- c('john', 'john', 'john', 'john', 'john','kate','kate')
time<- c('1-2018', '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018')
df<- data.frame(moviewatched, name, time)
感谢使用
dplyr
的解决方案。我们可以根据moviewatched
和name
删除重复的行,计算唯一的moviewatched
,然后使用cumsum
计算运行总数df2
是最终输出
library(dplyr)
df2 <- df %>%
distinct(moviewatched, name, .keep_all = TRUE) %>%
group_by(name, time) %>%
summarise(movietypewatched = n_distinct(moviewatched)) %>%
mutate(movietypewatched = cumsum(movietypewatched)) %>%
ungroup()
df2
# # A tibble: 4 x 3
# name time movietypewatched
# <fct> <fct> <int>
# 1 john 1-2018 2
# 2 john 2-2018 3
# 3 kate 1-2018 1
# 4 kate 2-2018 2
首先将时间数据转换为类以建立顺序,例如使用
lubridate::myd
withtruncated=1
。从这里开始,设置行的排列以确保它们是有序的,然后,按name
分组,使用purr::acculate
生成一个到目前为止在moviewasted
中看到的唯一值列表,调用该列表,length
将返回到该点看到的电影数量。使用max
按月聚合,以获得每个月的总累积类型
库(tidyverse)
df%
分组单位(名称)%>%
安排(姓名、时间)%%>%
变异(n_类型=长度(累积(电影观看,~unique(c(…))))%>%
分组人(姓名、时间)%>%
总结(n_类型=最大值(n_类型))
#>#tibble:4 x 3
#>#组:名称[2]
#>名称时间n_类型
#>
#>1约翰2018-01-01 2
#>2约翰2018-02-01 3
#>3凯特2018-01-01 1
#>4凯特2018-02-01 2
使用数据。表
:
library(data.table)
df <- unique(df)
setDT(df)[, movietypewatched := 1:.N, by = c("moviewatched", "name")]
df <- df[!(movietypewatched == 2), ]
df[, movietypewatched := .N, by = c("name", "time")][, moviewatched := NULL]
df <- unique(df)
df[, movietypewatched := cumsum(movietypewatched), by = name]
name time movietypewatched
1: john 1-2018 2
2: john 2-2018 3
3: kate 1-2018 1
4: kate 2-2018 2
库(data.table)
df制作一张第一次观看的日期表;按月统计;并取累计总和:
library(data.table)
setDT(df)
# fix bad date
df[, d := as.IDate(paste(time, "01", sep="-"), "%m-%Y-%d")]
# identify month first watched
fw = df[, .(d = min(d)), by=.(name, moviewatched)]
# count new movies per month
nm = fw[, .N, keyby=.(name, d)]
# take cumulative count
nm[, cN := cumsum(N), by=name]
name d N cN
1: john 2018-01-01 2 2
2: john 2018-02-01 1 3
3: kate 2018-01-01 1 1
4: kate 2018-02-01 1 2
您需要转换日期;否则,min()将不正确和/或损坏
这里有两个聚合步骤,但是由于data.table中的优化,代码应该是快速的(请参见?GForce
)。在这里,如果您想获得流派中的唯一值以及流派中的计数,可以执行中间步骤
请注意:
- 您需要按
名称、日期
排列数据框以累积值
- 您可以使用
lag()
获取上一个值。由于每个名称
的第一个条目没有以前的值,因此它将给出NA
- 使用
n\u distinct()
计算唯一类型时,需要删除NAs
>
库(dplyr)
图书馆(purrr)
图书馆(tidyr)
电影观看率%
变异(类型\所有=map2(类型,滞后(类型),rbind)%>%map(唯一))%>%
解组()%>%
变异(genre_count=map_int(genre_all,~lift(n_distinct)(.x,na.rm=TRUE)))
结果:
> df_final
# A tibble: 4 x 5
name time genre genre_all genre_count
<fct> <fct> <list> <list> <int>
1 john 1-2018 <tibble [3 x 1]> <tibble [3 x 1]> 2
2 john 2-2018 <tibble [2 x 1]> <tibble [3 x 1]> 3
3 kate 1-2018 <tibble [1 x 1]> <tibble [2 x 1]> 1
4 kate 2-2018 <tibble [1 x 1]> <tibble [2 x 1]> 2
>df_最终版本
#一个tibble:4x5
名称时间类型类型\u所有类型\u计数
1约翰1-2018 2
2约翰2-2018 3
3凯特1-2018 1
4凯特2-2018 2
wow。我不明白在整个分组中如何区分。。是因为总结
?以及数据表
中的等价物是什么?我不能那样做,尽管我知道这是一条路。例如,df[,uniqueN(moviewatched),by=(time,name)]
将不起作用,因为uniqueN
在分组中。@www.Bravo获取数据表
解决方案。我也不知道有重复的。我将删除我糟糕的答案。感谢againFyi,支持以下语法:duplicated(DT,by=c(“col1”,“col2”)),不过在这种情况下,您应该执行unique(DT,by=c(“col1”,“col2”)),我想。@Frank Good to knowduplicated(DT,by=c(“col1”,“col2”)
。谢谢我怎样才能得到观看新电影的实际数量?例如,2,1,1,1只显示他/她观看的新电影类型,非常感谢。可能df%>%groupu by(name)%%>%arrange(name,time)%%>%mutate(new=c(1,diff)(长度(累计(moviewatched,~unique(c(…)))%%groupu by(name,time)%%>%summary(types=sum(new))
library(data.table)
setDT(df)
# fix bad date
df[, d := as.IDate(paste(time, "01", sep="-"), "%m-%Y-%d")]
# identify month first watched
fw = df[, .(d = min(d)), by=.(name, moviewatched)]
# count new movies per month
nm = fw[, .N, keyby=.(name, d)]
# take cumulative count
nm[, cN := cumsum(N), by=name]
name d N cN
1: john 2018-01-01 2 2
2: john 2018-02-01 1 3
3: kate 2018-01-01 1 1
4: kate 2018-02-01 1 2
library(dplyr)
library(purrr)
library(tidyr)
moviewatched <- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama')
name <- c('john', 'john', 'john', 'john','kate','kate', 'john')
time <- c( '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018','1-2018')
df <- data.frame(moviewatched, name, time)
df_final <- df %>%
arrange(name, time) %>%
group_by(name, time) %>%
nest(.key= 'genre') %>%
group_by(name) %>%
mutate(genre_all = map2(genre, lag(genre), rbind) %>% map(unique)) %>%
ungroup() %>%
mutate(genre_count = map_int(genre_all, ~ lift(n_distinct)(.x, na.rm =TRUE)))
> df_final
# A tibble: 4 x 5
name time genre genre_all genre_count
<fct> <fct> <list> <list> <int>
1 john 1-2018 <tibble [3 x 1]> <tibble [3 x 1]> 2
2 john 2-2018 <tibble [2 x 1]> <tibble [3 x 1]> 3
3 kate 1-2018 <tibble [1 x 1]> <tibble [2 x 1]> 1
4 kate 2-2018 <tibble [1 x 1]> <tibble [2 x 1]> 2