在R中重塑从长到宽的数据集时有条件地填充缺少的值
我正在根据不同质量的多个数据集,为一系列年份和国家构建完整的指标时间表 使用在R中重塑从长到宽的数据集时有条件地填充缺少的值,r,reshape,missing-data,reshape2,R,Reshape,Missing Data,Reshape2,我正在根据不同质量的多个数据集,为一系列年份和国家构建完整的指标时间表 使用reformae2我将这些数据集“融合”到一个数据帧中 示例数据集: d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE", "DE", "GE"), class = "factor"), year =
reformae2
我将这些数据集“融合”到一个数据帧中
示例数据集:
d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE",
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L,
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L,
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5,
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA,
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"),
source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council",
"Eurostat", "OECD"), class = "factor")), .Names = c("cntry",
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA,
-19L))
d
# cntry year indicator sex source
# 1 BE 1960 5.5 male Eurostat
# 2 BE 1970 1.2 male Eurostat
# 3 BE 1980 1.5 male Eurostat
# 4 DE 1960 NA male Eurostat
# 5 DE 1970 1.4 male Eurostat
# 6 GE 1960 NA male Eurostat
# 7 GE 1970 NA male Eurostat
# 8 BE 1960 5.5 male OECD
# 9 BE 1970 1.2 male OECD
# 10 DE 1960 2.3 male OECD
# 11 DE 1970 1.4 male OECD
# 12 GE 1960 NA male OECD
# 13 GE 1970 1.4 male OECD
# 14 BE 1960 NA male Council
# 15 BE 1970 NA male Council
# 16 DE 1960 2.3 male Council
# 17 DE 1970 1.4 male Council
# 18 GE 1970 1.4 male Council
# 19 GE 1980 NA male Council
并可选择(或直接)转换为宽格式:
# cntry sex 1960 1970 1980
# BE male 5.5 1.2 1.5
# DE male 2.3 1.4 NA
# GE male NA 1.4 NA
我不确定这是否符合您的所有期望,但听起来您正在寻找以下内容:
toMerge <- expand.grid(cntry = c("BE", "DE", "GE"),
year = c(1960, 1970, 1980),
source = c("Eurostat", "OECD", "Council"),
sex = "male")
d2 <- merge(d, toMerge, all = TRUE)
d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")],
FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
# cntry year sex source indicator
# 2 BE 1960 male Eurostat 5.5
# 5 BE 1970 male Eurostat 1.2
# 8 BE 1980 male Eurostat 1.5
# 14 DE 1970 male Eurostat 1.4
# 17 DE 1980 male Eurostat NA
# 20 GE 1960 male Eurostat NA
# 26 GE 1980 male Eurostat NA
# 12 DE 1960 male OECD 2.3
# 24 GE 1970 male OECD 1.4
library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
# cntry 1960 1970 1980
# 1 BE 5.5 1.2 1.5
# 2 DE 2.3 1.4 NA
# 3 GE NA 1.4 NA
toMerge这里是另一个选项:
library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
d2, cntry + sex ~ year, value.var="indicator",
fun.aggregate=function(x) {
if(!length(x)) return(NA_real_)
xs <- do.call(rbind, x)
xs <- xs[complete.cases(xs), ]
if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )
注:我在“欧盟统计局”值上加了100,以使其与其他值区别开来,因为在这个样本集中,它们似乎是相等的
基本上,我们通过将指标
列转换为包含指标和来源的列表项列来作弊,然后使用fun.aggregate
从每个组中选择来源值最低的项目(注意,我们重置了因子,以便最理想的来源具有最低的级别) 也许以下方法也能奏效:
library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)
假设数据是按照您要求的顺序排列的,也就是说,源
列首先由欧盟统计局
排序,然后由经合组织
排序,然后由理事会
排序,我将继续使用数据。表
如下:
require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator",
subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))
# cntry sex 1960 1970 1980
# 1: BE male 5.5 1.2 1.5
# 2: DE male 2.3 1.4 NA
# 3: GE male NA 1.4 NA
简洁的解决方案
library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)
y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")
require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator",
subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))
# cntry sex 1960 1970 1980
# 1: BE male 5.5 1.2 1.5
# 2: DE male 2.3 1.4 NA
# 3: GE male NA 1.4 NA