从R中的行内容创建变量
我有一个医院就诊数据,其中包含性别、年龄、主要诊断和医院标识符的记录。我打算为这些条目创建单独的变量。数据有一些模式:大多数观察以性别代码(M或F)开始,然后是年龄,然后是诊断,主要是医院标识符。但也有一些例外。在某些情况下,性别id编码为01或02,在这种情况下,性别标识符显示在末尾。 我查阅了档案,发现了一些grep的例子,但我没有成功地将其有效地应用到我的数据中。例如代码从R中的行内容创建变量,r,R,我有一个医院就诊数据,其中包含性别、年龄、主要诊断和医院标识符的记录。我打算为这些条目创建单独的变量。数据有一些模式:大多数观察以性别代码(M或F)开始,然后是年龄,然后是诊断,主要是医院标识符。但也有一些例外。在某些情况下,性别id编码为01或02,在这种情况下,性别标识符显示在末尾。 我查阅了档案,发现了一些grep的例子,但我没有成功地将其有效地应用到我的数据中。例如代码 ndiag<-dat[grep("copd", dat[,1], fixed = TRUE),] 这个问题似
ndiag<-dat[grep("copd", dat[,1], fixed = TRUE),]
这个问题似乎有两个关键部分
str\u locate
和substr
第1部分-清理m/f//01/02编码的字符串
#我们稍后将使用此库进行str_检测、str_替换等
图书馆(stringr)
#首先,确保诊断是字符(字符串)而不是因子(类别)
诊断请包括可输入R的样本数据。“dput”是从R打印此信息的便捷功能。是否有任何数据不以性别代码开头或结尾?我感谢您花费如此多宝贵的时间制作如此精彩的代码。剧本成功地完成了我想要的,非常感谢。
diagnosis hospital diag age gender
m3034CVDA A cvd 30-34 M
m3034cardvA A cardv 30-34 M
f3034aceB B ace 30-34 F
m3034hfC C hf 30-34 M
m3034cereC C cere 30-34 M
m3034resPC C resp 30-34 M
3034copd_Z_01 Z copd 30-34 M
3034copd_Z_01 Z copd 30-34 M
fcereZ Z cere NA F
f3034respC C resp 30-34 F
3034copd_Z_02 Z copd 30-34 F
# We will be using this library later for str_detect, str_replace, etc
library(stringr)
# first, make sure diagnosis is character (strings) and not factor (category)
diagnosis <- as.character(diagnosis)
# We will use a temporary vector, to preserve the original, but this is not a necessary step.
diagnosisTmp <- diagnosis
males <- str_locate(diagnosisTmp, "_01")
females <- str_locate(diagnosisTmp, "_02")
# NOTE: All of this will work fine as long as '_01'/'_02' appears *__only__* as gender code.
# Therefore, we put in the next two lines to check for errors, make sure we didn't accidentally grab a "_01" from the middle of the string
#-------------------------
if (any(str_length(diagnosisTmp) != males[,2], na.rm=T)) stop ("Error in coding for males")
if (any(str_length(diagnosisTmp) != females[,2], na.rm=T)) stop ("Error in coding for females")
#------------------------
# remove all the '_01'/'_02' (replacing with "")
diagnosisTmp <- str_replace(diagnosisTmp, "_01", "")
diagnosisTmp <- str_replace(diagnosisTmp, "_02", "")
# append to front of string appropriate m/f code
diagnosisTmp[!is.na(males[,1])] <- paste0("m", diagnosisTmp[!is.na(males[,1])])
diagnosisTmp[!is.na(females[,1])] <- paste0("m", diagnosisTmp[!is.na(females[,1])])
# remove superfluous underscores
diagnosisTmp <- str_replace(diagnosisTmp, "_", "")
# display the original next to modified, for visual spot check
cbind(diagnosis, diagnosisTmp)
# gender is the first char, hospital is the last.
gender <- toupper(str_sub(diagnosisTmp, 1,1))
hosp <- str_sub(diagnosisTmp, -1,-1)
# age, if present is char 2-5. A warning will be shown if values are missing. Age needs to be cleaned up
age <- as.numeric(str_sub(diagnosisTmp, 2,5)) # as.numeric will convert none-numbers to NA
age[!is.na(age)] <- paste(substr(age[!is.na(age)], 1, 2), substr(age[!is.na(age)], 3, 4), sep="-")
# diagnosis is variable length, so we have to find where to start
diagStart <- 2 + 4*(!is.na(age))
diag <- str_sub(diagnosisTmp, diagStart, -2)
# Put it all together into a data frame
dat <- data.frame(diagnosis, hosp, diag, age, gender)
## OR WITHOUT ORIGINAL DIAGNOSIS STRING ##
dat <- data.frame(hosp, diag, age, gender)