R 如何用给定给同一ID的先前非NA值替换NA值_R_Data.table

R 如何用给定给同一ID的先前非NA值替换NA值

R 如何用给定给同一ID的先前非NA值替换NA值,r,data.table,R,Data.table,我在R工作，正在使用data.table。我有一个如下所示的数据集： ID country_id weight 1 BGD 56 1 NA 57 1 NA 63 2 SA 12 2 NA 53 2 SA 54 如果country_id中的值为NA，我需要将其替换为给定给同一id的非NA country_id值。我希望数据

我在R工作，正在使用data.table。我有一个如下所示的数据集：

ID   country_id    weight
1    BGD           56
1    NA            57
1    NA            63
2    SA            12
2    NA            53
2    SA            54

如果country_id中的值为NA，我需要将其替换为给定给同一id的非NA country_id值。我希望数据集如下所示：

ID   country_id    weight
1    BGD           56
1    BGD           57
1    BGD           63
2    SA            12
2    SA            53
2    SA            54

此数据集包含数百万个ID，因此无法手动修复每个ID

谢谢你的帮助

编辑：解决了

我使用了以下代码：

dt[，country\u id:=country\u id[！is.nacountry\u id][1]，by=id]

根据评论中的答案/建议，您有几个选择。我模拟了一个数据集，其中有1000000行，在您的国家/地区id列中缺少了30%，以了解在您的情况下，什么是最适合的

在这个基准测试中，伸缩性最好的答案将NA替换为具有相同ID的第一个非缺失值：dt[，country_ID:=country_ID[！is.nacountry_ID][1]，by=ID]

基准代码：

数据：

在这个基准测试中，伸缩性最好的答案将NA替换为具有相同ID的第一个非缺失值：dt[，country_ID:=country_ID[！is.nacountry_ID][1]，by=ID]

基准代码：

数据：

另一个选项是使用联接：

DT[is.na(country_id), country_id := 
    DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]]

说明：

DT[is.nacountry\u id将数据集子集为在country\u id列中具有NAs的数据集

.SD是数据的子集，也是上一步的Data.table

DT[！is.nacountry_id][.SD，on=.id left使用id作为键将.SD与DT[！is.nacountry_id]连接起来

j=country\u id返回右表DT[！is.nacountry\u id]中的country\u id列，如果存在多个匹配项，mult=first将返回第一个匹配项

country_id:=将DT行中的列country_id（其中is.nacountry_id为TRUE）更新为联接的结果

时间代码和类似但更大的数据符合Andrew的：

library(data.table)
set.seed(42)

nr <- 1e7
dt <- data.table(ID = rep(1:(nr/4), each = 4),
    country_id = rep(rep(c("BGD", "SA", "USA", "DEN", "THI"), each = 4)),
    weight = sample(10:100, nr, TRUE))
dt[sample(1:nr, nr/2), country_id := NA]
DT <- copy(dt)

microbenchmark::microbenchmark(
    first_nonmissing = dt[, country_id := country_id[!is.na(country_id)][1L], by = ID],
    use_join=DT[is.na(country_id), country_id := DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]],
    times = 1L
)

另一个选项是使用联接：

DT[is.na(country_id), country_id := 
    DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]]

说明：

DT[is.nacountry\u id将数据集子集为在country\u id列中具有NAs的数据集

.SD是数据的子集，也是上一步的Data.table

DT[！is.nacountry_id][.SD，on=.id left使用id作为键将.SD与DT[！is.nacountry_id]连接起来

j=country\u id返回右表DT[！is.nacountry\u id]中的country\u id列，如果存在多个匹配项，mult=first将返回第一个匹配项

country_id:=将DT行中的列country_id（其中is.nacountry_id为TRUE）更新为联接的结果

时间代码和类似但更大的数据符合Andrew的：

library(data.table)
set.seed(42)

nr <- 1e7
dt <- data.table(ID = rep(1:(nr/4), each = 4),
    country_id = rep(rep(c("BGD", "SA", "USA", "DEN", "THI"), each = 4)),
    weight = sample(10:100, nr, TRUE))
dt[sample(1:nr, nr/2), country_id := NA]
DT <- copy(dt)

microbenchmark::microbenchmark(
    first_nonmissing = dt[, country_id := country_id[!is.na(country_id)][1L], by = ID],
    use_join=DT[is.na(country_id), country_id := DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]],
    times = 1L
)

希望下面的代码可以帮助您填写NA

这能回答你的问题吗？或者dt[，country\u id:=country\u id[！is.nacountry\u id][1]，by=id]应该work@Andrew谢谢！！这很有效！@sindri_baldur这很公平，尽管我标记的副本包含多个data.table。我试图找到一篇其他许多人都链接到的帖子。你想标记更多data.table-specific的帖子吗？这当然是一个已经被覆盖的问题。你可能想在nafill和setnafi上跟踪这个问题这能回答你的问题吗？或者dt[，country\u id:=country\u id[！is.nacountry\u id][1]，by=id]应该work@Andrew谢谢！！这很有效！@sindri_baldur这很公平，尽管我标记的副本包含多个data.table。我试图找到一篇其他许多人都链接到的帖子。你想标记更多data.table-specific的帖子吗？这当然是一个已经被覆盖的问题。你可能想在nafill和setnafi上跟踪这个问题ll代表字符列。Chinsoon，你介意详细分析连接在这里所做的事情吗？Chinsoon，你介意详细分析连接在这里所做的事情吗？

library(data.table)
set.seed(42)

nr <- 1e7
dt <- data.table(ID = rep(1:(nr/4), each = 4),
    country_id = rep(rep(c("BGD", "SA", "USA", "DEN", "THI"), each = 4)),
    weight = sample(10:100, nr, TRUE))
dt[sample(1:nr, nr/2), country_id := NA]
DT <- copy(dt)

microbenchmark::microbenchmark(
    first_nonmissing = dt[, country_id := country_id[!is.na(country_id)][1L], by = ID],
    use_join=DT[is.na(country_id), country_id := DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]],
    times = 1L
)

Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 first_nonmissing 3282.1373 3282.1373 3282.1373 3282.1373 3282.1373 3282.1373     1
         use_join  554.5314  554.5314  554.5314  554.5314  554.5314  554.5314     1

res <- Reduce(rbind,
       lapply(split(df,df$ID), function(v) 
         {v$country_id <- head(v$country_id[!is.na(v$country_id)],1);v}))

  ID country_id weight
1  1        BGD     56
2  1        BGD     57
3  1        BGD     63
4  2         SA     12
5  2         SA     53
6  2         SA     54