R:gather()用于整理具有两个列标题的数据
我想整理一下前两行中不幸设置了两列标题的一些数据:R:gather()用于整理具有两个列标题的数据,r,dplyr,R,Dplyr,我想整理一下前两行中不幸设置了两列标题的一些数据: 第一行(标题):实际上是度量的类型(例如。 估计,标准误差,上限,下限) 第二行(也是标题):是测量年份 是否有某种方法可以使用gather()或其他任何方法来整理此数据 此外,当一个度量值被重复时(例如,Rank,Rank.1),它实际上应该只读取Rank,并且由于年份的不同而有所不同。有没有办法解决这个问题 Country_Territory WBCode Estimate StdErr NumSrc Rank Lowe
- 第一行(标题):实际上是度量的类型(例如。 估计,标准误差,上限,下限)
- 第二行(也是标题):是测量年份
gather()
或其他任何方法来整理此数据
此外,当一个度量值被重复时(例如,Rank,Rank.1),它实际上应该只读取Rank,并且由于年份的不同而有所不同。有没有办法解决这个问题
Country_Territory WBCode Estimate StdErr NumSrc Rank Lower
1 Year <NA> 1996.00 1996.00 1996 1996.00 1996.00
2 Andorra ADO 1.32 0.48 1 87.10 72.04
3 Afghanistan AFG -1.29 0.34 2 4.30 0.00
4 Angola AGO -1.17 0.26 4 9.68 0.54
Upper Estimate.1 StdErr.1 NumSrc.1 Rank.1 Lower.1 Upper.1
1 1996.00 1998.00 1998.00 1998 1998.00 1998.00 1998.00
2 96.77 1.38 0.46 1 89.18 74.74 96.91
3 27.42 -1.18 0.33 2 9.79 0.00 31.44
4 27.42 -1.41 0.21 6 1.55 0.00 13.40
没有给我想要的东西:
Country_Territory WBCode measure number
1 Year <NA> Estimate 1996.00
2 Andorra ADO Estimate 1.32
3 Afghanistan AFG Estimate -1.29
4 Angola AGO Estimate -1.17
5 Year <NA> StdErr 1996.00
6 Andorra ADO StdErr 0.48
国家/地区WBCode度量值编号
1年估计数1996.00
2安道尔ADO估计数1.32
3阿富汗空军估计数-1.29
4安哥拉前估计数-1.17
5年标准1996.00
6安道尔ADO标准差0.48
因为年份与国家/地区混在一起。这是一种选择:
library(tidyverse)
# get unique Year values and create column names (to add later)
df %>%
filter(Country_Territory == "Year") %>%
gather() %>%
filter(value != "Year" & !is.na(value)) %>%
pull(value) %>%
unique() %>%
paste0("Year_",.) -> col_years
# reshape data (excluding the Year row)
df %>%
filter(Country_Territory != "Year") %>%
gather(key,y,-Country_Territory, -WBCode) %>%
separate(key, c("measure","v")) %>%
group_by(v = ifelse(is.na(v), 0, v)) %>%
nest() -> df_info
reduce(df_info$data, function(x,y) left_join(x,y,by=c("Country_Territory","WBCode","measure"))) %>%
setNames(c("Country_Territory", "WBCode", "measure", col_years))
# # A tibble: 18 x 5
# Country_Territory WBCode measure Year_1996 Year_1998
# <chr> <chr> <chr> <dbl> <dbl>
# 1 Andorra ADO Estimate 1.32 1.38
# 2 Afghanistan AFG Estimate -1.29 -1.18
# 3 Angola AGO Estimate -1.17 -1.41
# 4 Andorra ADO StdErr 0.48 0.46
# 5 Afghanistan AFG StdErr 0.34 0.33
# 6 Angola AGO StdErr 0.26 0.21
# 7 Andorra ADO NumSrc 1 1
# 8 Afghanistan AFG NumSrc 2 2
# 9 Angola AGO NumSrc 4 6
# 10 Andorra ADO Rank 87.1 89.2
# 11 Afghanistan AFG Rank 4.3 9.79
# 12 Angola AGO Rank 9.68 1.55
# 13 Andorra ADO Lower 72.0 74.7
# 14 Afghanistan AFG Lower 0 0
# 15 Angola AGO Lower 0.54 0
# 16 Andorra ADO Upper 96.8 96.9
# 17 Afghanistan AFG Upper 27.4 31.4
# 18 Angola AGO Upper 27.4 13.4
这是一种选择:
library(tidyverse)
# get unique Year values and create column names (to add later)
df %>%
filter(Country_Territory == "Year") %>%
gather() %>%
filter(value != "Year" & !is.na(value)) %>%
pull(value) %>%
unique() %>%
paste0("Year_",.) -> col_years
# reshape data (excluding the Year row)
df %>%
filter(Country_Territory != "Year") %>%
gather(key,y,-Country_Territory, -WBCode) %>%
separate(key, c("measure","v")) %>%
group_by(v = ifelse(is.na(v), 0, v)) %>%
nest() -> df_info
reduce(df_info$data, function(x,y) left_join(x,y,by=c("Country_Territory","WBCode","measure"))) %>%
setNames(c("Country_Territory", "WBCode", "measure", col_years))
# # A tibble: 18 x 5
# Country_Territory WBCode measure Year_1996 Year_1998
# <chr> <chr> <chr> <dbl> <dbl>
# 1 Andorra ADO Estimate 1.32 1.38
# 2 Afghanistan AFG Estimate -1.29 -1.18
# 3 Angola AGO Estimate -1.17 -1.41
# 4 Andorra ADO StdErr 0.48 0.46
# 5 Afghanistan AFG StdErr 0.34 0.33
# 6 Angola AGO StdErr 0.26 0.21
# 7 Andorra ADO NumSrc 1 1
# 8 Afghanistan AFG NumSrc 2 2
# 9 Angola AGO NumSrc 4 6
# 10 Andorra ADO Rank 87.1 89.2
# 11 Afghanistan AFG Rank 4.3 9.79
# 12 Angola AGO Rank 9.68 1.55
# 13 Andorra ADO Lower 72.0 74.7
# 14 Afghanistan AFG Lower 0 0
# 15 Angola AGO Lower 0.54 0
# 16 Andorra ADO Upper 96.8 96.9
# 17 Afghanistan AFG Upper 27.4 31.4
# 18 Angola AGO Upper 27.4 13.4
或许(如果您只有两个系列的措施):
或许(如果您只有两个系列的措施):
数据表解决方案
需要做一些准备(设置colnames和创建一个唯一名称表),但是速度非常快
该解决方案也可以使用两年以上
library( data.table )
dt <- as.data.table( df ) #or use setDT( df )
#extract unique years from the first row from the thirs column untill end of dt
dt.years <- as.data.table ( unique( t( (dt[1, 3:ncol(dt)]) ) ) )
dt.years[, year_id := 1:.N ]
setnames(dt.years, c("year", "year_id" ) )
#melt row 2:n of the data.table
dt.melt <- melt( dt[2:nrow(dt)],
id.vars = c( "Country_Territory", "WBCode"),
measure = patterns( "Estimate", "StdErr", "NumSrc", "Rank", "Lower", "Upper"),
value.name = c( "Estimate", "StdErr", "NumSrc", "Rank", "Lower", "Upper" ),
variable.name = "year")
#left join both datatables
result <- dt.years[dt.melt, on = c( year_id = "year")]
#cleaning and renaming
result[, year_id := NULL]
数据表解决方案
需要做一些准备(设置colnames和创建一个唯一名称表),但是速度非常快
该解决方案也可以使用两年以上
library( data.table )
dt <- as.data.table( df ) #or use setDT( df )
#extract unique years from the first row from the thirs column untill end of dt
dt.years <- as.data.table ( unique( t( (dt[1, 3:ncol(dt)]) ) ) )
dt.years[, year_id := 1:.N ]
setnames(dt.years, c("year", "year_id" ) )
#melt row 2:n of the data.table
dt.melt <- melt( dt[2:nrow(dt)],
id.vars = c( "Country_Territory", "WBCode"),
measure = patterns( "Estimate", "StdErr", "NumSrc", "Rank", "Lower", "Upper"),
value.name = c( "Estimate", "StdErr", "NumSrc", "Rank", "Lower", "Upper" ),
variable.name = "year")
#left join both datatables
result <- dt.years[dt.melt, on = c( year_id = "year")]
#cleaning and renaming
result[, year_id := NULL]
melt()
的data.table
方法能够同时重塑多个测量列的形状。不需要使用patterns()
函数重命名列
library(data.table)
# reshape multiple measure columns simultaneously from wide to long format
cols <- c("Estimate", "StdErr", "NumSrc", "Rank", "Lower", "Upper")
long <- melt(setDT(df), measure.vars = patterns(cols), value.name = cols)
# extract years
yrs <- long[Country_Territory == "Year", .(variable, Year = as.integer(Estimate))]
# join to get a separate Year column, remove Year rows and helper column
result <- yrs[long[Country_Territory != "Year"], on = "variable"][, variable := NULL][]
result
重塑后,
变量
列表示宽格式中属于一个列子集的行,即属于一个特定年份的行。数据。melt()
的表方法能够同时重塑多个度量列。不需要使用patterns()
函数重命名列
library(data.table)
# reshape multiple measure columns simultaneously from wide to long format
cols <- c("Estimate", "StdErr", "NumSrc", "Rank", "Lower", "Upper")
long <- melt(setDT(df), measure.vars = patterns(cols), value.name = cols)
# extract years
yrs <- long[Country_Territory == "Year", .(variable, Year = as.integer(Estimate))]
# join to get a separate Year column, remove Year rows and helper column
result <- yrs[long[Country_Territory != "Year"], on = "variable"][, variable := NULL][]
result
重塑后,
变量
列表示宽格式中属于一个列子集的行,即属于一个特定年份的行。如@Uwe所述,您不需要首先重命名列。。我删除了我答案中的(多余的)部分。正如@Uwe提到的,您不需要首先重命名列。。我删除了答案中的(多余的)部分。在完整的数据集上工作得绝对完美!我想知道你的这部分代码是做什么的?看起来你们在分离“度量”,实际上并没有两个部分可以分开separate(key,c(“measure”,“v”))%%>%group_by(v=ifelse(is.na(v),0,v))%%>%
在完整的数据集上工作得绝对完美!我想知道你的这部分代码是做什么的?看起来你们在分离“度量”,实际上并没有两个部分可以分开<代码>分开(键,c(“测量”,“v”))%%>%group_by(v=ifelse(is.na(v),0,v))%%>%
# year Country_Territory WBCode Estimate StdErr NumSrc Rank Lower Upper
# 1: 1996 Andorra ADO 1.32 0.48 1 87.10 72.04 96.77
# 2: 1996 Afghanistan AFG -1.29 0.34 2 4.30 0.00 27.42
# 3: 1996 Angola AGO -1.17 0.26 4 9.68 0.54 27.42
# 4: 1998 Andorra ADO 1.38 0.46 1 89.18 74.74 96.91
# 5: 1998 Afghanistan AFG -1.18 0.33 2 9.79 0.00 31.44
# 6: 1998 Angola AGO -1.41 0.21 6 1.55 0.00 13.40
library(data.table)
# reshape multiple measure columns simultaneously from wide to long format
cols <- c("Estimate", "StdErr", "NumSrc", "Rank", "Lower", "Upper")
long <- melt(setDT(df), measure.vars = patterns(cols), value.name = cols)
# extract years
yrs <- long[Country_Territory == "Year", .(variable, Year = as.integer(Estimate))]
# join to get a separate Year column, remove Year rows and helper column
result <- yrs[long[Country_Territory != "Year"], on = "variable"][, variable := NULL][]
result
Year Country_Territory WBCode Estimate StdErr NumSrc Rank Lower Upper
1: 1996 Andorra ADO 1.32 0.48 1 87.10 72.04 96.77
2: 1996 Afghanistan AFG -1.29 0.34 2 4.30 0.00 27.42
3: 1996 Angola AGO -1.17 0.26 4 9.68 0.54 27.42
4: 1998 Andorra ADO 1.38 0.46 1 89.18 74.74 96.91
5: 1998 Afghanistan AFG -1.18 0.33 2 9.79 0.00 31.44
6: 1998 Angola AGO -1.41 0.21 6 1.55 0.00 13.40