如何在R中的单个组内使用na.approx函数进行插值/外推_R_Panel Data_Linear Interpolation_Extrapolation

如何在R中的单个组内使用na.approx函数进行插值/外推

如何在R中的单个组内使用na.approx函数进行插值/外推,r,panel-data,linear-interpolation,extrapolation,R,Panel Data,Linear Interpolation,Extrapolation,我有一个18年（2000-2017年）60个国家的10个变量的面板数据集，我有很多缺失的数据 Country Year Broadband Albania 2000 NA Albania 2001 NA Albania 2002 NA Albania 2003 NA Albania 2004 NA Albania 2005 272 Albania 2006 NA Albania 2007 10000 Albania 2008 6400

我有一个18年（2000-2017年）60个国家的10个变量的面板数据集，我有很多缺失的数据

Country Year    Broadband

Albania 2000    NA
Albania 2001    NA
Albania 2002    NA
Albania 2003    NA
Albania 2004    NA
Albania 2005    272
Albania 2006    NA
Albania 2007    10000
Albania 2008    64000
Albania 2009    92000
Albania 2010    105539
Albania 2011    128210
Albania 2012    160088
Albania 2013    182556
Albania 2014    207931
Albania 2015    242870
Albania 2016    263874
Albania 2017    NA
Algeria 2000    NA
Algeria 2001    NA
Algeria 2002    NA
Algeria 2003    18000
Algeria 2004    36000

我想使用R中的na.approx函数进行插值（并使用rule=2进行外推），但仅限于每个国家。例如，在这个示例数据集中，我想插值阿尔巴尼亚2006年的值，并外推阿尔巴尼亚2000-2004年和2017年的值。但我想确保阿尔巴尼亚2017年的价值不会使用阿尔巴尼亚2016年和阿尔及利亚2003年进行插值。对于阿尔及利亚2000-2002年，我希望使用阿尔及利亚2003年和2004年的数据外推这些值。我尝试了以下代码：

data <- group_by(data, country)
data$broadband <- na.approx(data$broadband, maxgap = Inf, rule = 2)
data <- as.data.frame(data)

mylist <- split(data, data$country)

alb <- mylist[1]
alb <- as_data_frame(alb)
alg <- mylist[2]
alg <- as_data_frame(alg)
ang <- mylist[3]
ang <- as_data_frame(ang)

如你所见，安哥拉2000-2005年的估算值似乎是使用阿尔及利亚的值计算的，因为估算值远高于安哥拉2006年的7458

编辑3：这是我使用的完整代码-

data <- read_excel("~/Documents/data.xlsx")

> dput(head(data))
structure(list(continent = c("Europe", "Europe", "Europe", "Europe", 
"Europe", "Europe"), country = c("Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania"), Year = c(2000, 2001, 2002, 
2003, 2004, 2005), `Individuals Using Internet, %, WB` = c(0.114097347, 
0.325798377, 0.390081273, 0.971900415, 2.420387798, 6.043890864
), `Secure Internet Servers, WB` = c(NA, 1, NA, 1, 2, 1), `Mobile Cellular 
Subscriptions, WB` = c(29791, 
392650, 851000, 1100000, 1259590, 1530244), `Fixed Broadband Subscriptions, 
WB` = c(NA, 
NA, NA, NA, NA, 272), `Trade, % GDP, WB` = c(55.9204287230026, 
57.4303612453301, 63.9342407411882, 65.4406219482911, 66.3578254370479, 
70.2953012017195), `Air transport, freight (million ton-km)` = c(0.003, 
0.003, 0.144, 0.088, 0.099, 0.1), `Air Transport, registered carrier 
departures worldwide, WB` = c(3885, 
3974, 3762, 3800, 4104, 4309), `FDI, net, inflows, % GDP, WB` = 
c(3.93717707227928, 
5.10495722596557, 3.04391445388559, 3.09793068135411, 4.66563777108359, 
3.21722676118428), `Number of Airports, WFB` = c(10, 11, 11, 
11, 11, 11), `Currently under EU Arms Sanctions` = c(0, 0, 0, 
0, 0, 0), `Currently under EU Economic Sanctions` = c(0, 0, 0, 
0, 0, 0), `Currently under UN Arms Sanctions` = c(0, 0, 0, 0, 
0, 0), `Currently under UN Economic Sanctions` = c(0, 0, 0, 0, 
0, 0), `Currently under US Arms Embargo` = c(0, 0, 0, 0, 0, 0
), `Currently under US Economic Sanctions` = c(0, 0, 0, 0, 0, 
0)), .Names = c("continent", "country", "Year", "Individuals Using Internet, 
%, WB", 
"Secure Internet Servers, WB", "Mobile Cellular Subscriptions, WB", 
"Fixed Broadband Subscriptions, WB", "Trade, % GDP, WB", "Air transport, 
freight (million ton-km)", 
"Air Transport, registered carrier departures worldwide, WB", 
"FDI, net, inflows, % GDP, WB", "Number of Airports, WFB", "Currently under EU 
 Arms Sanctions", 
"Currently under EU Economic Sanctions", "Currently under UN Arms Sanctions", 
"Currently under UN Economic Sanctions", "Currently under US Arms Embargo", 
"Currently under US Economic Sanctions"), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

 data_imputed <- data %>% 
group_by(country) %>% 
mutate(broadband_imp = na.approx(broadband, maxgap=Inf, rule = 2))

数据dput（头部（数据））
结构（列表）（大陆=c（“欧洲”、“欧洲”、“欧洲”、“欧洲”，
“欧洲”、“欧洲”），国家=c（“阿尔巴尼亚”、“阿尔巴尼亚”、“阿尔巴尼亚”，
“阿尔巴尼亚”、“阿尔巴尼亚”、“阿尔巴尼亚”，年份=c（2000年、2001年、2002年、，
2003年、2004年、2005年，`使用互联网的个人，%，WB`=c（0.114097347，
0.325798377, 0.390081273, 0.971900415, 2.420387798, 6.043890864
)，`secureinternetservers，WB`=c（NA，1，NA，1，2，1），`mobileCellular
订阅，WB`=c（29791，
3926508510110000012595901530244）“固定宽带订阅，
WB`=c（NA，
NA，NA，NA，NA，272），‘贸易，%GDP，WB`=c（55.920428730026，
57.4303612453301, 63.9342407411882, 65.4406219482911, 66.3578254370479, 
70.2953012017195），`空运、货运（百万吨公里）`=c（0.003，
0.003,0.144,0.088,0.099,0.1），`航空运输，注册承运人
世界各地的航班，WB`=c（3885，
3974376238004104309），“外国直接投资，净流入，国内生产总值%，世界银行”=
c（3.9371770722728，
5.10495722596557, 3.04391445388559, 3.09793068135411, 4.66563777108359, 
3.21722676118428），`机场数量，WFB`=c（10,11,11，
11，11，11），`目前受到欧盟武器制裁'=c（0，0，0，
“目前受到欧盟经济制裁”=c（0,0,0，
“目前受到联合国武器制裁”=c（0，0，0，0，
“目前受到联合国经济制裁”=c（0,0,0,0，
“目前处于美国武器禁运之下”=c（0,0,0,0,0,0,0
)，`目前受到美国经济制裁`=c（0,0,0,0,0，
0）），.name=c（“大陆”、“国家”、“年份”、“使用互联网的个人”，
%，WB“，
“安全互联网服务器，WB”，“移动蜂窝订阅，WB”，
“固定宽带订阅，世界银行”，“贸易，%GDP，世界银行”，“航空运输，
运费（百万吨公里）“，
“航空运输，世界各地的注册承运人离境，WB”，
“外国直接投资、净流入、国内生产总值百分比、世界银行”、“机场数量、世界银行”、“目前在欧盟
武器制裁“，
“目前受到欧盟经济制裁”，“目前受到联合国武器制裁”，
“目前受到联合国经济制裁”，“目前受到美国武器禁运”，
“目前受到美国经济制裁”），row.names=c（NA，-6L
)，class=c（“tbl_df”，“tbl”，“data.frame”））
数据_估算百分比
按（国家）划分的组别%>%
变异（宽频带=近似值（宽频带，最大间隙=Inf，规则=2））

您可以使用

分组依据

和

变异

：

library(tidyverse)
library(zoo)

df_imputed <- df %>% 
group_by(Country) %>% 
mutate(Broadband_imputed = na.approx(Broadband, maxgap = Inf, rule = 2))

库（tidyverse）
图书馆（动物园）
df_估算百分比
按（国家）划分的组别%>%
变异（宽频带_插补=na.近似值（宽频带，最大间隙=Inf，规则=2））

给

> head(df_imputed)
# A tibble: 6 x 4
# Groups:   Country [1]
  Country  Year Broadband Broadband_imputed
   <fctr> <int>     <int>             <dbl>
1 Albania  2000        NA               272
2 Albania  2001        NA               272
3 Albania  2002        NA               272
4 Albania  2003        NA               272
5 Albania  2004        NA               272
6 Albania  2005       272               272

>头部（df_估算）
#一个tibble:6x4
#分组：国家[1]
国家/地区年度宽带估算
1阿尔巴尼亚2000 NA 272
2阿尔巴尼亚2001 NA 272
3阿尔巴尼亚2002 NA 272
4阿尔巴尼亚2003 NA 272
5阿尔巴尼亚2004 NA 272
6阿尔巴尼亚2005 272 272

及

>df_插补%>%过滤器（国家==“阿尔及利亚”）
#一个tibble:5x4
#分组：国家[1]
国家/地区年度宽带估算
1阿尔及利亚2000 NA 18000
2阿尔及利亚2001 NA 18000
3阿尔及利亚2002 NA 18000
4阿尔及利亚2003 18000 18000
5阿尔及利亚2004 36000 36000

数据

df <- read.table(text = "Country Year    Broadband
Albania 2000    NA
Albania 2001    NA
Albania 2002    NA
Albania 2003    NA
Albania 2004    NA
Albania 2005    272
Albania 2006    NA
Albania 2007    10000
Albania 2008    64000
Albania 2009    92000
Albania 2010    105539
Albania 2011    128210
Albania 2012    160088
Albania 2013    182556
Albania 2014    207931
Albania 2015    242870
Albania 2016    263874
Albania 2017    NA
Algeria 2000    NA
Algeria 2001    NA
Algeria 2002    NA
Algeria 2003    18000
Algeria 2004    36000", header = TRUE)

df不幸的是，我认为这没有正常工作。我认为，当一些国家有2004-2017年的数据，但没有2000-2003年的数据，因此无法在同一国家内的值之间计算插值时，就会出现问题。因此，它被迫使用邻近国家的值进行插值。我不确定这是否正确，因为我不是na.appro函数工作原理的专家，但我的理解是，它主要是为插值设计的，带有外推选项。是否有专门用于外推的函数？该函数将每个国家的第一个观察值向后进行，否？我已编辑了“问题”以更深入地回答您的问题。当我包括安哥拉时，我得到了不同的输出。您能在问题中包含dput（head（）
）的输出吗？（当然，其中
被替换为您的数据帧的名称。）您解决了问题吗？奇怪的是dput
的输出没有给出它应该给出的内容。一些）
缺失或位于错误的位置<代码>名称

仅包含四个条目（以及一个缺少的

）

），尽管数据中有七个变量。如果你发送你的整个数据集，我会看看我能做些什么。最好的，MarkusHi。我的数据集很大，所以我真的不知道如何发送它！我将编辑这个问题，这样您就可以看到我到目前为止所做工作的完整代码，再次使用dput。非常感谢你的帮助！我的想法快用完了，但这可能是一个包冲突。将代码中的

mutate

替换为

dplyr:：mutate

。如果行得通，我会的

> df_imputed %>% filter(Country == 'Algeria')
# A tibble: 5 x 4
# Groups:   Country [1]
  Country  Year Broadband Broadband_imputed
   <fctr> <int>     <int>             <dbl>
1 Algeria  2000        NA             18000
2 Algeria  2001        NA             18000
3 Algeria  2002        NA             18000
4 Algeria  2003     18000             18000
5 Algeria  2004     36000             36000

df <- read.table(text = "Country Year    Broadband
Albania 2000    NA
Albania 2001    NA
Albania 2002    NA
Albania 2003    NA
Albania 2004    NA
Albania 2005    272
Albania 2006    NA
Albania 2007    10000
Albania 2008    64000
Albania 2009    92000
Albania 2010    105539
Albania 2011    128210
Albania 2012    160088
Albania 2013    182556
Albania 2014    207931
Albania 2015    242870
Albania 2016    263874
Albania 2017    NA
Algeria 2000    NA
Algeria 2001    NA
Algeria 2002    NA
Algeria 2003    18000
Algeria 2004    36000", header = TRUE)