R data.table:通过选择性地忽略行来过滤数据输入错误
在一个有数百万行信息的项目中,我们希望每年有一行被研究的案例,我们发现了一个数据输入错误,因此一些案例有额外行的错误,其中一些变量不同。(也就是说,使用R data.table:通过选择性地忽略行来过滤数据输入错误,r,data.table,R,Data.table,在一个有数百万行信息的项目中,我们希望每年有一行被研究的案例,我们发现了一个数据输入错误,因此一些案例有额外行的错误,其中一些变量不同。(也就是说,使用unique或duplicated无法修复这些问题)在手动检查了其中许多问题后,我们了解了问题以及结果应该如何。但我需要你的帮助,让data.table做正确的事情 我做了一个小测试用例。下面是重现数据的代码 > DT year case status 1: 1980 a born 2: 1980 a aliv
unique
或duplicated
无法修复这些问题)在手动检查了其中许多问题后,我们了解了问题以及结果应该如何。但我需要你的帮助,让data.table做正确的事情
我做了一个小测试用例。下面是重现数据的代码
> DT
year case status
1: 1980 a born
2: 1980 a alive
3: 1981 a alive
4: 1982 a alive
5: 1999 b alive
6: 1999 b alive
7: 2000 b alive
8: 2004 c alive
9: 2005 c alive
10: 1977 d alive
11: 1977 d dead
12: 1983 e alive
13: 1984 e born
14: 1984 e alive
15: 1985 e alive
16: 1986 e alive
17: 2000 f alive
18: 2001 f alive
19: 2002 f alive
20: 2002 f dead
21: 2003 f alive
year case status
需要解决的问题
library(data.table)
DT <- data.table(year = c(1980, 1980, 1981, 1982, 1999, 1999, 2000,
2004, 2005, 1977, 1977, 1983, 1984, 1984,
1985, 1986, 2000, 2001, 2002, 2002, 2003),
case = c("a", "a", "a", "a", "b", "b", "b", "c",
"c", "d", "d", "e", "e", "e", "e", "e",
"f", "f", "f", "f", "f"),
status = c("born", "alive", "alive", "alive", "alive",
"alive", "alive", "alive", "alive", "alive",
"dead", "alive", "born", "alive", "alive", "alive",
"alive", "alive", "alive", "dead", "alive"))
## re-order the rows, just in case
DT <- DT[order(case, year, status)]
## A correct answer would be:
DT[-c(1, 5, 14, 20)]
## Here is my effort to fix problem 1.
setkey(DT, case, year)
## create a bunch of index variables, naive way to
## find first year with multiple rows
DT[ , idx:=1:.N, by = list(case, year)]
## number of rows with case,year
DT[ , count := uniqueN(status), by = list(case, year)]
DT[ , caseyr := 1:.N, by = list(case)]
DT[ , casegrp := .GRP, by = list(case, year)]
这里有一个解决方案:
library(data.table)
DT <- data.table(year = c(1980, 1980, 1981, 1982, 1999, 1999, 2000,
2004, 2005, 1977, 1977, 1983, 1984, 1984,
1985, 1986, 2000, 2001, 2002, 2002, 2003),
case = c("a", "a", "a", "a", "b", "b", "b", "c",
"c", "d", "d", "e", "e", "e", "e", "e",
"f", "f", "f", "f", "f"),
status = c("born", "alive", "alive", "alive", "alive",
"alive", "alive", "alive", "alive", "alive",
"dead", "alive", "born", "alive", "alive", "alive",
"alive", "alive", "alive", "dead", "alive"))
#to check the ids removed are 1,5,14,20
setkey(DT,case, year, status)
DT[, id := 1:.N]
DT <- DT[order(case, year, status, -id)]
#remove duplicated alive (or other)
DT <- DT[!duplicated(DT[, list(case, year, status)])]
#compute year ordering
DT[, status_rank := rank(year), by = list(case)]
#remove late born
DT[, is_first := status_rank == min(status_rank), by = case]
DT <- DT[status != "born" | is_first]
#remove early dead
DT[, is_last := status_rank == max(status_rank), by = case]
DT <- DT[status != "dead" | is_last]
#remove redundant alive unless it's with dead
DT[, keep_alive := paste(sort(unique(status)), collapse = "") %in% c("alive", "alivedead") , by = list(case, year)]
DT <- DT[status != "alive" | keep_alive]
DT[, c("status_rank", "is_first", "is_last", "keep_alive") := NULL]
DT
year case status id
1: 1980 a born 2
2: 1981 a alive 3
3: 1982 a alive 4
4: 1999 b alive 6
5: 2000 b alive 7
6: 2004 c alive 8
7: 2005 c alive 9
8: 1977 d alive 10
9: 1977 d dead 11
10: 1983 e alive 12
11: 1984 e alive 13
12: 1985 e alive 15
13: 1986 e alive 16
14: 2000 f alive 17
15: 2001 f alive 18
16: 2002 f alive 19
17: 2003 f alive 21
库(data.table)
DT这里有一个解决方案:
library(data.table)
DT <- data.table(year = c(1980, 1980, 1981, 1982, 1999, 1999, 2000,
2004, 2005, 1977, 1977, 1983, 1984, 1984,
1985, 1986, 2000, 2001, 2002, 2002, 2003),
case = c("a", "a", "a", "a", "b", "b", "b", "c",
"c", "d", "d", "e", "e", "e", "e", "e",
"f", "f", "f", "f", "f"),
status = c("born", "alive", "alive", "alive", "alive",
"alive", "alive", "alive", "alive", "alive",
"dead", "alive", "born", "alive", "alive", "alive",
"alive", "alive", "alive", "dead", "alive"))
#to check the ids removed are 1,5,14,20
setkey(DT,case, year, status)
DT[, id := 1:.N]
DT <- DT[order(case, year, status, -id)]
#remove duplicated alive (or other)
DT <- DT[!duplicated(DT[, list(case, year, status)])]
#compute year ordering
DT[, status_rank := rank(year), by = list(case)]
#remove late born
DT[, is_first := status_rank == min(status_rank), by = case]
DT <- DT[status != "born" | is_first]
#remove early dead
DT[, is_last := status_rank == max(status_rank), by = case]
DT <- DT[status != "dead" | is_last]
#remove redundant alive unless it's with dead
DT[, keep_alive := paste(sort(unique(status)), collapse = "") %in% c("alive", "alivedead") , by = list(case, year)]
DT <- DT[status != "alive" | keep_alive]
DT[, c("status_rank", "is_first", "is_last", "keep_alive") := NULL]
DT
year case status id
1: 1980 a born 2
2: 1981 a alive 3
3: 1982 a alive 4
4: 1999 b alive 6
5: 2000 b alive 7
6: 2004 c alive 8
7: 2005 c alive 9
8: 1977 d alive 10
9: 1977 d dead 11
10: 1983 e alive 12
11: 1984 e alive 13
12: 1985 e alive 15
13: 1986 e alive 16
14: 2000 f alive 17
15: 2001 f alive 18
16: 2002 f alive 19
17: 2003 f alive 21
库(data.table)
DT这是另一个解决方案。基本上,我们关注问题年份(每个案例的计数,年份>1),并根据您的说明进行筛选
## re-order the rows, just in case
DT <- DT[order(case, year, status)]
DT <- unique(DT) #fix case b 1999
#create indicator more than one data per year
DT[,count_born_alive:=sum(status=="born",status=="alive"),by=.(case,year)]
DT[,count_alive_dead:=sum(status=="alive",status=="dead"),by=.(case,year)]
#cumsum alive
DT[,alive_sum:=cumsum(status=="alive"),by=case]
#filter problem rows
DT <-DT[DT[, .I[!(count_born_alive>1&status=="alive"&alive_sum==1)], by = case]$V1] #Case "a" has 2 rows for 1980
DT <-DT[DT[, .I[!(count_born_alive>1&status=="born"&alive_sum>1)], by = case]$V1] #Case "f" has an erroneous "dead" in 2002. Because it shows as "alive" in 2003
DT <-DT[DT[, .I[!(count_alive_dead>1&status=="dead"&alive_sum<max(alive_sum))], by = case]$V1] #fix Case "f" has an erroneous "dead" in 2002
DT[,.(year,case,status)]
year case status
1: 1980 a born
2: 1981 a alive
3: 1982 a alive
4: 1999 b alive
5: 2000 b alive
6: 2004 c alive
7: 2005 c alive
8: 1977 d alive
9: 1977 d dead
10: 1983 e alive
11: 1984 e alive
12: 1985 e alive
13: 1986 e alive
14: 2000 f alive
15: 2001 f alive
16: 2002 f alive
17: 2003 f alive
##重新排列行,以防万一
DT这是另一个解决方案。基本上,我们关注问题年份(每个案例的计数,年份>1),并根据您的说明进行筛选
## re-order the rows, just in case
DT <- DT[order(case, year, status)]
DT <- unique(DT) #fix case b 1999
#create indicator more than one data per year
DT[,count_born_alive:=sum(status=="born",status=="alive"),by=.(case,year)]
DT[,count_alive_dead:=sum(status=="alive",status=="dead"),by=.(case,year)]
#cumsum alive
DT[,alive_sum:=cumsum(status=="alive"),by=case]
#filter problem rows
DT <-DT[DT[, .I[!(count_born_alive>1&status=="alive"&alive_sum==1)], by = case]$V1] #Case "a" has 2 rows for 1980
DT <-DT[DT[, .I[!(count_born_alive>1&status=="born"&alive_sum>1)], by = case]$V1] #Case "f" has an erroneous "dead" in 2002. Because it shows as "alive" in 2003
DT <-DT[DT[, .I[!(count_alive_dead>1&status=="dead"&alive_sum<max(alive_sum))], by = case]$V1] #fix Case "f" has an erroneous "dead" in 2002
DT[,.(year,case,status)]
year case status
1: 1980 a born
2: 1981 a alive
3: 1982 a alive
4: 1999 b alive
5: 2000 b alive
6: 2004 c alive
7: 2005 c alive
8: 1977 d alive
9: 1977 d dead
10: 1983 e alive
11: 1984 e alive
12: 1985 e alive
13: 1986 e alive
14: 2000 f alive
15: 2001 f alive
16: 2002 f alive
17: 2003 f alive
##重新排列行,以防万一
非常感谢。军衔是一颗神奇的子弹。这个答案对我来说更容易理解。非常感谢。军衔是一颗神奇的子弹。这个答案对我来说更容易理解。非常感谢。我认为另一个答案更容易理解/预测,但这确实有效。非常感谢。我认为另一个答案更容易理解/预测,但这确实有效。