R-根据缺失数据的减少选择成对案例
我正试图找出如何根据缺失数据的最佳组合对数据库进行子集划分 我的数据是这样的R-根据缺失数据的减少选择成对案例,r,select,na,R,Select,Na,我正试图找出如何根据缺失数据的最佳组合对数据库进行子集划分 我的数据是这样的 Country.Name X2010.x X2011.x X2012.x X2010.y X2011.y X2012.y 20 Belarus 15080 16410 16800 27.72 26.46 NA 21 Belgium 38810 40210 39870 NA NA NA 22 Belize
Country.Name X2010.x X2011.x X2012.x X2010.y X2011.y X2012.y
20 Belarus 15080 16410 16800 27.72 26.46 NA
21 Belgium 38810 40210 39870 NA NA NA
22 Belize 7720 7940 8170 NA NA NA
23 Benin 1590 1640 1710 NA NA 43.53
24 Bermuda 69340 66640 66390 NA NA NA
25 Bhutan 6140 6680 6960 NA NA 38.73
...............................................................
每年。x作为每年选择。y
如果.x或.y中缺少一个,我不能选择成对组合
最后,我需要的是一个没有NA的数据库。为每个国家选择的年份并不重要,.x和.y必须是同一年
如果我看一下.x和.y之间缺失的分布,我可以看出选择X2011将是最好的组合
colSums(is.na(data))
Country.Name X2010.x X2011.x X2012.x X2010.y X2011.y X2012.y
0 3 3 3 21 19 22
但我想这是总体上最好的组合,但不是针对每个特定的国家
我只需要在数据中保留最大数量的国家
我如何根据具体的失踪案例最大限度地选择国家?
你明白我的问题吗
有什么建议吗
非最佳但可能的结果:
Country.Name .x .y
20 Belarus 15080 27.72
31 Bulgaria 13950 35.78
35 Cambodia 2350 33.55
37 Canada 39200 33.68
45 China 9010 42.06
#
data = select(data, Country.Name, X2010.x, X2010.y)
data = na.omit(data)
数据集
data <- structure(list(Country.Name = c("Belarus", "Belgium", "Belize",
"Benin", "Bermuda", "Bhutan", "Bolivia", "Bosnia and Herzegovina",
"Botswana", "Brazil", "Brunei Darussalam", "Bulgaria", "Burkina Faso",
"Burundi", "Cabo Verde", "Cambodia", "Cameroon", "Canada", "Caribbean small states",
"Cayman Islands", "Central African Republic", "Central Europe and the Baltics",
"Chad", "Channel Islands", "Chile", "China"), X2010.x = c(15080,
38810, 7720, 1590, 69340, 6140, 4950, 8860, 12500, 13520, NA,
13950, 1390, 710, 5630, 2350, 2390, 39200, 13141.13583, NA, 880,
19213.13055, 1850, NA, 17010, 9010), X2011.x = c(16410, 40210,
7940, 1640, 66640, 6680, 5200, 9310, 13930, 14030, NA, 14790,
1430, 730, 5960, 2530, 2470, 40570, 12973.98051, NA, 910, 20391.27796,
1850, NA, 19040, 9940), X2012.x = c(16800, 39870, 8170, 1710,
66390, 6960, 5400, 9290, 14630, 14350, NA, 15250, 1550, 750,
6220, 2710, 2550, 41170, 13245.52928, NA, 950, 20765.62768, 1930,
NA, 20140, 10890), X2010.y = c(27.72, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 35.78, NA, NA, NA, 33.55, NA, 33.68, NA, NA,
NA, NA, NA, NA, NA, 42.06), X2011.y = c(26.46, NA, NA, NA, NA,
NA, 46.26, NA, NA, 53.09, NA, 34.28, NA, NA, NA, 31.82, NA, NA,
NA, NA, NA, NA, 43.3, NA, 50.84, NA), X2012.y = c(NA, NA, NA,
43.53, NA, 38.73, 46.64, NA, NA, 52.67, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Country.Name",
"X2010.x", "X2011.x", "X2012.x", "X2010.y", "X2011.y", "X2012.y"
), row.names = 20:45, class = "data.frame")
以下是dplyr和tidyr解决方案:
data %>%
gather(YearXY, Value, -Country.Name, na.rm = TRUE) %>%
separate(YearXY, c("Year", "XY")) %>%
spread(XY, Value) %>% filter(!is.na(x) & !is.na(y)) %>%
group_by(Country.Name) %>%
slice(1)
请注意,它忽略了没有同时包含x和y的年份的国家
对于随机年份,将切片1替换为:
请根据示例数据提供所需输出。预期输出与示例数据不对应。可能是librarydata.table;meltsetdata,measure.vars=list2:4,5:7,na.rm=TRUE,value.name=c'x',y'[,lapply.SD,max,Country.name,.SDcols=x:y]在描述中,您提到选择2011年最适合整体组合。但是,在预期中,您选择白俄罗斯的2010年价值。该值小于16410。与data.table的devel版本类似的选项是meltsetdata,measure.vars=list2:4,5:7,value.name=c'x',y'[!is.nax&!is.nay][,.SD[1L],Country.name][,variable:=NULL][@Nick-我收到此错误消息;matrixunlistpieces中的错误,ncol=n,byrow=TRUE:“数据”必须是向量类型,根据dput输出为'NULL'@giacomoV,我使用dplyr_0.4.1没有收到任何错误。9000@giacomoV我也不是。我正在使用dplyr和tidyr的最新版本。你确定你的数据称为data,是data.frame吗?@NickK-my版本:dplyr_0.4.2和tidyr_0.2.0。我刚刚尝试了你建议的编辑数据,结果发现了错误。它来自分隔线arxy,cYear,XY
mutate(Random = sample(n())) %>%
filter(Random == 1) %>%
select(-Random)