R data.table:通过选择性地忽略行来过滤数据输入错误

R data.table:通过选择性地忽略行来过滤数据输入错误,r,data.table,R,Data.table,在一个有数百万行信息的项目中,我们希望每年有一行被研究的案例,我们发现了一个数据输入错误,因此一些案例有额外行的错误,其中一些变量不同。(也就是说,使用unique或duplicated无法修复这些问题)在手动检查了其中许多问题后,我们了解了问题以及结果应该如何。但我需要你的帮助,让data.table做正确的事情 我做了一个小测试用例。下面是重现数据的代码 > DT year case status 1: 1980 a born 2: 1980 a aliv

在一个有数百万行信息的项目中,我们希望每年有一行被研究的案例,我们发现了一个数据输入错误,因此一些案例有额外行的错误,其中一些变量不同。(也就是说,使用
unique
duplicated
无法修复这些问题)在手动检查了其中许多问题后,我们了解了问题以及结果应该如何。但我需要你的帮助,让data.table做正确的事情

我做了一个小测试用例。下面是重现数据的代码

> DT
    year case status
 1: 1980    a   born
 2: 1980    a  alive
 3: 1981    a  alive
 4: 1982    a  alive
 5: 1999    b  alive
 6: 1999    b  alive
 7: 2000    b  alive
 8: 2004    c  alive
 9: 2005    c  alive
10: 1977    d  alive
11: 1977    d   dead
12: 1983    e  alive
13: 1984    e   born
14: 1984    e  alive
15: 1985    e  alive
16: 1986    e  alive
17: 2000    f  alive
18: 2001    f  alive
19: 2002    f  alive
20: 2002    f   dead
21: 2003    f  alive
    year case status
需要解决的问题

  • 案例“a”在1980年有两行。因为这是第一张唱片 在这种情况下,我们希望保留“出生”和 移除一个“活着的”。案例“b”在1999年有2行,但 我们两个都没有“出生”。我们想保持公正 一排活的

  • 案例“e”在1984年有一个错误的“born”。因为它是“活着的” 1983年,1984年的“出生”应被删除

  • 案例“f”在2002年有一个错误的“死亡”。因为它显示 作为2003年的“活着”,我们认为“死”是一个错误。因此,删除 一个说“死”的人,但只是因为下一次之后 “活着的”

  • 案例“d”在1977年有两行,但我们希望保留这两行

    我一直在寻找一种好方法来隔离行组。对于每一种情况,我都要标记第一年的行,然后考虑怎么做。似乎用.GRP命名grow组会有一定的清晰度,但我仍然存在为每个案例隔离第一行组的问题

    library(data.table)
    DT <- data.table(year = c(1980, 1980, 1981, 1982, 1999, 1999, 2000,
                              2004, 2005, 1977, 1977, 1983, 1984, 1984,
                              1985, 1986, 2000, 2001, 2002, 2002, 2003),
                     case = c("a", "a",  "a", "a",   "b",  "b", "b", "c",
                               "c", "d", "d", "e",  "e", "e", "e", "e",
                               "f", "f", "f", "f", "f"),
                     status = c("born", "alive", "alive", "alive", "alive",
                                "alive", "alive", "alive", "alive", "alive",
                                "dead", "alive", "born", "alive", "alive", "alive",
                                "alive", "alive", "alive", "dead", "alive"))
    
    ## re-order the rows, just in case
    DT <- DT[order(case, year, status)]
    ## A correct answer would be:
    DT[-c(1, 5, 14, 20)]
    
    ## Here is my effort to fix problem 1.
    setkey(DT, case, year)
    
    ## create a bunch of index variables, naive way to
    ## find first year with multiple rows
    DT[ , idx:=1:.N, by = list(case, year)]
    ## number of rows with case,year 
    DT[ , count := uniqueN(status), by = list(case, year)]
    DT[ , caseyr := 1:.N, by = list(case)]
    DT[ , casegrp := .GRP, by = list(case, year)]
    
    这里有一个解决方案:

    library(data.table)
    DT <- data.table(year = c(1980, 1980, 1981, 1982, 1999, 1999, 2000,
                              2004, 2005, 1977, 1977, 1983, 1984, 1984,
                              1985, 1986, 2000, 2001, 2002, 2002, 2003),
                     case = c("a", "a",  "a", "a",   "b",  "b", "b", "c",
                              "c", "d", "d", "e",  "e", "e", "e", "e",
                              "f", "f", "f", "f", "f"),
                     status = c("born", "alive", "alive", "alive", "alive",
                                "alive", "alive", "alive", "alive", "alive",
                                "dead", "alive", "born", "alive", "alive", "alive",
                                "alive", "alive", "alive", "dead", "alive"))
    #to check the ids removed are 1,5,14,20
    setkey(DT,case, year, status)
    DT[, id := 1:.N]
    DT <- DT[order(case, year, status, -id)]
    
    #remove duplicated alive (or other)
    DT <- DT[!duplicated(DT[, list(case, year, status)])]
    
    #compute year ordering
    DT[, status_rank := rank(year), by = list(case)]
    
    #remove late born
    DT[, is_first := status_rank == min(status_rank), by = case]
    DT <- DT[status != "born" | is_first]
    
    #remove early dead
    DT[, is_last := status_rank == max(status_rank), by = case]
    DT <- DT[status != "dead" | is_last]
    
    #remove redundant alive unless it's with dead
    DT[, keep_alive := paste(sort(unique(status)), collapse = "") %in% c("alive", "alivedead") , by = list(case, year)]
    DT <- DT[status != "alive" | keep_alive]
    DT[, c("status_rank", "is_first", "is_last", "keep_alive") := NULL]
    DT
        year case status id
     1: 1980    a   born  2
     2: 1981    a  alive  3
     3: 1982    a  alive  4
     4: 1999    b  alive  6
     5: 2000    b  alive  7
     6: 2004    c  alive  8
     7: 2005    c  alive  9
     8: 1977    d  alive 10
     9: 1977    d   dead 11
    10: 1983    e  alive 12
    11: 1984    e  alive 13
    12: 1985    e  alive 15
    13: 1986    e  alive 16
    14: 2000    f  alive 17
    15: 2001    f  alive 18
    16: 2002    f  alive 19
    17: 2003    f  alive 21
    
    库(data.table)
    DT这里有一个解决方案:

    library(data.table)
    DT <- data.table(year = c(1980, 1980, 1981, 1982, 1999, 1999, 2000,
                              2004, 2005, 1977, 1977, 1983, 1984, 1984,
                              1985, 1986, 2000, 2001, 2002, 2002, 2003),
                     case = c("a", "a",  "a", "a",   "b",  "b", "b", "c",
                              "c", "d", "d", "e",  "e", "e", "e", "e",
                              "f", "f", "f", "f", "f"),
                     status = c("born", "alive", "alive", "alive", "alive",
                                "alive", "alive", "alive", "alive", "alive",
                                "dead", "alive", "born", "alive", "alive", "alive",
                                "alive", "alive", "alive", "dead", "alive"))
    #to check the ids removed are 1,5,14,20
    setkey(DT,case, year, status)
    DT[, id := 1:.N]
    DT <- DT[order(case, year, status, -id)]
    
    #remove duplicated alive (or other)
    DT <- DT[!duplicated(DT[, list(case, year, status)])]
    
    #compute year ordering
    DT[, status_rank := rank(year), by = list(case)]
    
    #remove late born
    DT[, is_first := status_rank == min(status_rank), by = case]
    DT <- DT[status != "born" | is_first]
    
    #remove early dead
    DT[, is_last := status_rank == max(status_rank), by = case]
    DT <- DT[status != "dead" | is_last]
    
    #remove redundant alive unless it's with dead
    DT[, keep_alive := paste(sort(unique(status)), collapse = "") %in% c("alive", "alivedead") , by = list(case, year)]
    DT <- DT[status != "alive" | keep_alive]
    DT[, c("status_rank", "is_first", "is_last", "keep_alive") := NULL]
    DT
        year case status id
     1: 1980    a   born  2
     2: 1981    a  alive  3
     3: 1982    a  alive  4
     4: 1999    b  alive  6
     5: 2000    b  alive  7
     6: 2004    c  alive  8
     7: 2005    c  alive  9
     8: 1977    d  alive 10
     9: 1977    d   dead 11
    10: 1983    e  alive 12
    11: 1984    e  alive 13
    12: 1985    e  alive 15
    13: 1986    e  alive 16
    14: 2000    f  alive 17
    15: 2001    f  alive 18
    16: 2002    f  alive 19
    17: 2003    f  alive 21
    
    库(data.table)
    
    DT这是另一个解决方案。基本上,我们关注问题年份(每个案例的计数,年份>1),并根据您的说明进行筛选

    ## re-order the rows, just in case
    DT <- DT[order(case, year, status)]
    DT <- unique(DT) #fix case b 1999
    
    #create indicator more than one data per year
    DT[,count_born_alive:=sum(status=="born",status=="alive"),by=.(case,year)]
    DT[,count_alive_dead:=sum(status=="alive",status=="dead"),by=.(case,year)]
    
    #cumsum alive
    DT[,alive_sum:=cumsum(status=="alive"),by=case]
    
    #filter problem rows
    DT <-DT[DT[, .I[!(count_born_alive>1&status=="alive"&alive_sum==1)], by = case]$V1] #Case "a" has 2 rows for 1980
    DT <-DT[DT[, .I[!(count_born_alive>1&status=="born"&alive_sum>1)], by = case]$V1] #Case "f" has an erroneous "dead" in 2002. Because it shows as "alive" in 2003
    DT <-DT[DT[, .I[!(count_alive_dead>1&status=="dead"&alive_sum<max(alive_sum))], by = case]$V1] #fix Case "f" has an erroneous "dead" in 2002
    
    DT[,.(year,case,status)]
        year case status
     1: 1980    a   born
     2: 1981    a  alive
     3: 1982    a  alive
     4: 1999    b  alive
     5: 2000    b  alive
     6: 2004    c  alive
     7: 2005    c  alive
     8: 1977    d  alive
     9: 1977    d   dead
    10: 1983    e  alive
    11: 1984    e  alive
    12: 1985    e  alive
    13: 1986    e  alive
    14: 2000    f  alive
    15: 2001    f  alive
    16: 2002    f  alive
    17: 2003    f  alive
    
    ##重新排列行,以防万一
    
    DT这是另一个解决方案。基本上,我们关注问题年份(每个案例的计数,年份>1),并根据您的说明进行筛选

    ## re-order the rows, just in case
    DT <- DT[order(case, year, status)]
    DT <- unique(DT) #fix case b 1999
    
    #create indicator more than one data per year
    DT[,count_born_alive:=sum(status=="born",status=="alive"),by=.(case,year)]
    DT[,count_alive_dead:=sum(status=="alive",status=="dead"),by=.(case,year)]
    
    #cumsum alive
    DT[,alive_sum:=cumsum(status=="alive"),by=case]
    
    #filter problem rows
    DT <-DT[DT[, .I[!(count_born_alive>1&status=="alive"&alive_sum==1)], by = case]$V1] #Case "a" has 2 rows for 1980
    DT <-DT[DT[, .I[!(count_born_alive>1&status=="born"&alive_sum>1)], by = case]$V1] #Case "f" has an erroneous "dead" in 2002. Because it shows as "alive" in 2003
    DT <-DT[DT[, .I[!(count_alive_dead>1&status=="dead"&alive_sum<max(alive_sum))], by = case]$V1] #fix Case "f" has an erroneous "dead" in 2002
    
    DT[,.(year,case,status)]
        year case status
     1: 1980    a   born
     2: 1981    a  alive
     3: 1982    a  alive
     4: 1999    b  alive
     5: 2000    b  alive
     6: 2004    c  alive
     7: 2005    c  alive
     8: 1977    d  alive
     9: 1977    d   dead
    10: 1983    e  alive
    11: 1984    e  alive
    12: 1985    e  alive
    13: 1986    e  alive
    14: 2000    f  alive
    15: 2001    f  alive
    16: 2002    f  alive
    17: 2003    f  alive
    
    ##重新排列行,以防万一
    
    非常感谢。军衔是一颗神奇的子弹。这个答案对我来说更容易理解。非常感谢。军衔是一颗神奇的子弹。这个答案对我来说更容易理解。非常感谢。我认为另一个答案更容易理解/预测,但这确实有效。非常感谢。我认为另一个答案更容易理解/预测,但这确实有效。