R当列不';不能代表所有的时间点
我的数据集看起来有点像这样:R当列不';不能代表所有的时间点,r,dataframe,reshape2,R,Dataframe,Reshape2,我的数据集看起来有点像这样: ID job01 age_started_job01 job02 age_started_job02 job03 age_started_job03 1 "waiter" 18 "lawyer" 25 NA NA 2 "plumber" 18 "builder" 20 "for
ID job01 age_started_job01 job02 age_started_job02 job03 age_started_job03
1 "waiter" 18 "lawyer" 25 NA NA
2 "plumber" 18 "builder" 20 "foreman" 25
ID job age
1 "waiter" 18
1 "waiter" 19
1 "waiter" 20
1 "waiter" 21
1 "waiter" 22
1 "waiter" 23
1 "waiter" 24
1 "lawyer" 25
2 "plumber" 18
2 "plumber" 19
2 "builder" 20
2 "builder" 21
2 "builder" 22
2 "builder" 23
2 "builder" 24
2 "foreman" 25
我正试图找到一种我有这种想法的方式(请注意,我假设如果人们在那一年没有开始新的工作,他们会继续从事与前一年相同的工作):
然后将其转换为long,如下所示:
ID job01 age_started_job01 job02 age_started_job02 job03 age_started_job03
1 "waiter" 18 "lawyer" 25 NA NA
2 "plumber" 18 "builder" 20 "foreman" 25
ID job age
1 "waiter" 18
1 "waiter" 19
1 "waiter" 20
1 "waiter" 21
1 "waiter" 22
1 "waiter" 23
1 "waiter" 24
1 "lawyer" 25
2 "plumber" 18
2 "plumber" 19
2 "builder" 20
2 "builder" 21
2 "builder" 22
2 "builder" 23
2 "builder" 24
2 "foreman" 25
从宽到长的步骤可以通过重塑2
完成,我知道如何完成,但我无法从第一个数据集转换为中间格式。我尝试过类似的东西(丑陋且带有循环):
#创建作业变量
年龄(18:25){
数据[,粘贴(“工作年龄”,年龄,九月=”)]你可以试试
df1 <- reshape(df, idvar='ID', timevar='time',
varying=list(c(2,4,6), c(3,5,7)), direction='long')
colnames(df1)[4] <- 'age'
d1 <- data.frame(age=18:25)
res <- do.call(rbind,lapply(split(df1, df1$ID), function(x) {
x1 <- merge(x, d1, by='age', all=TRUE)
x1$job <-unique(na.omit(x1$job01))[cumsum(!is.na(x1$job01))]
x1$ID <- x1$ID[1]
na.omit(x1[,c(2,5,1)])}))
row.names(res) <- NULL
res
# ID job age
#1 1 waiter 18
#2 1 waiter 19
#3 1 waiter 20
#4 1 waiter 21
#5 1 waiter 22
#6 1 waiter 23
#7 1 waiter 24
#8 1 lawyer 25
#9 2 plumber 18
#10 2 plumber 19
#11 2 builder 20
#12 2 builder 21
#13 2 builder 22
#14 2 builder 23
#15 2 builder 24
#16 2 foreman 25
df1您可以试试
df1 <- reshape(df, idvar='ID', timevar='time',
varying=list(c(2,4,6), c(3,5,7)), direction='long')
colnames(df1)[4] <- 'age'
d1 <- data.frame(age=18:25)
res <- do.call(rbind,lapply(split(df1, df1$ID), function(x) {
x1 <- merge(x, d1, by='age', all=TRUE)
x1$job <-unique(na.omit(x1$job01))[cumsum(!is.na(x1$job01))]
x1$ID <- x1$ID[1]
na.omit(x1[,c(2,5,1)])}))
row.names(res) <- NULL
res
# ID job age
#1 1 waiter 18
#2 1 waiter 19
#3 1 waiter 20
#4 1 waiter 21
#5 1 waiter 22
#6 1 waiter 23
#7 1 waiter 24
#8 1 lawyer 25
#9 2 plumber 18
#10 2 plumber 19
#11 2 builder 20
#12 2 builder 21
#13 2 builder 22
#14 2 builder 23
#15 2 builder 24
#16 2 foreman 25
df1这里是另一种使用我的“splitstackshape”函数中的merged.stack
+expandRows
的方法。这个答案使用@akrun的示例数据:
DT <- na.omit(merged.stack(df, var.stubs = c("job", "age_started_job"),
sep = "var.stubs"))
DT[, age_started_job := as.numeric(age_started_job)]
DT[, Range := diff(c(age_started_job, (age_started_job[.N]+1))), by = list(ID)]
expandRows(DT, "Range")[, age_started_job := age_started_job +
(sequence(.N)-1), by = list(ID, .time_1)][]
# ID .time_1 job age_started_job
# 1: 1 01 waiter 18
# 2: 1 01 waiter 19
# 3: 1 01 waiter 20
# 4: 1 01 waiter 21
# 5: 1 01 waiter 22
# 6: 1 01 waiter 23
# 7: 1 01 waiter 24
# 8: 1 02 lawyer 25
# 9: 2 01 plumber 18
# 10: 2 01 plumber 19
# 11: 2 02 builder 20
# 12: 2 02 builder 21
# 13: 2 02 builder 22
# 14: 2 02 builder 23
# 15: 2 02 builder 24
# 16: 2 03 foreman 25
DT这里有另一种方法,它使用我的“splitstackshape”函数中的merged.stack
+expandRows
。这个答案使用@akrun的示例数据:
DT <- na.omit(merged.stack(df, var.stubs = c("job", "age_started_job"),
sep = "var.stubs"))
DT[, age_started_job := as.numeric(age_started_job)]
DT[, Range := diff(c(age_started_job, (age_started_job[.N]+1))), by = list(ID)]
expandRows(DT, "Range")[, age_started_job := age_started_job +
(sequence(.N)-1), by = list(ID, .time_1)][]
# ID .time_1 job age_started_job
# 1: 1 01 waiter 18
# 2: 1 01 waiter 19
# 3: 1 01 waiter 20
# 4: 1 01 waiter 21
# 5: 1 01 waiter 22
# 6: 1 01 waiter 23
# 7: 1 01 waiter 24
# 8: 1 02 lawyer 25
# 9: 2 01 plumber 18
# 10: 2 01 plumber 19
# 11: 2 02 builder 20
# 12: 2 02 builder 21
# 13: 2 02 builder 22
# 14: 2 02 builder 23
# 15: 2 02 builder 24
# 16: 2 03 foreman 25
DT您的预期结果显示年龄19、21等,这在输入数据集中找不到,因为这些是要输入的。原始数据集列出了所有工作(包括失业作为工作类型),因此我可以安全地假设一个受试者在下一次开始之前一直从事相同的工作。因此,例如,年龄19
的条目与18岁时观察到的条目相同,除非该受试者恰好在19岁开始新的工作。如果所有年龄组都在列表中,这会更容易。现在,我们必须做一些merging
如我在解决方案中所示。另外,您只是在寻找如何获取中间数据集还是最终数据集?是的。如果我可以获取任何一个数据集,那么它只是以通常的方式使用Reforme2的熔化和铸造。您的预期结果显示年龄19、21
等,这在输入数据集中找不到,因为这些是原始数据集列出了所有工作(包括失业作为一种工作类型),因此我可以安全地假设一个受试者在下一次开始之前一直从事相同的工作。因此,例如,年龄19
的条目与18岁时观察到的条目相同,除非该受试者恰好在19岁开始新的工作。如果所有年龄组都在列表中,这会更容易。现在,我们必须做一些merging
如我在解决方案中所示。另外,您只是在寻找如何获取中间数据集还是最终数据集?是的。如果我可以获取任何一个数据集,则只需以通常的方式使用重塑2的熔化和铸造即可。+1.不错的选项。我已经尝试使用“splitstackshape”包也:-)+1用于使用dplyr。另外,变量indx在后来我发现真实数据中有18岁后第一份工作的人时派上了用场。mutate(job=replace(job,is.na(job),na.omit(job))
返回“replacement的长度为0”,但发出mutate(job=replace(job,is.na(job)&indx!-!-0,na.omit)(作业))
解决了它。@Guilherme Kenji Chihaya感谢您的评论。我将更新该函数。+1.选项不错。我也尝试过使用“splitstackshape”包:-)+1用于使用dplyr。另外,变量indx在后来我发现真实数据中有18岁后第一份工作的人时派上了用场。mutate(job=replace(job,is.na(job),na.omit(job))
返回“replacement has length 0”,但发出mutate(job=replace(job,is.na(job)&indx!-!-0,na.omit(job)))
解决了它。@Guilherme Kenji Chihaya感谢您的评论。我将更新该函数。