R 根据条件将所有字符串替换为数字
我有一列描述可能的疾病的数据。我试图将这些定性的价值观转变为定量的价值观。例如,设置条件,如“如果一行包含单词“血压”,则删除所有字符并替换为3,如果一行包含“心脏”,则替换为2,如果一行包含“糖尿病”或“肾病”,则替换为1,如果任何其他条件替换为0.5” 例如,我的数据如下所示:R 根据条件将所有字符串替换为数字,r,conditional-statements,bioinformatics,R,Conditional Statements,Bioinformatics,我有一列描述可能的疾病的数据。我试图将这些定性的价值观转变为定量的价值观。例如,设置条件,如“如果一行包含单词“血压”,则删除所有字符并替换为3,如果一行包含“心脏”,则替换为2,如果一行包含“糖尿病”或“肾病”,则替换为1,如果任何其他条件替换为0.5” 例如,我的数据如下所示: Gene Condition Gene1 Name=Asymmetrical dimethylarginine level, Name=Bipolar disorder and schizophreni
Gene Condition
Gene1 Name=Asymmetrical dimethylarginine level, Name=Bipolar disorder and schizophrenia, Name=3-hydroxypropylmercapturic acid levels in smoker
Gene2 Name=blood pressure, Name=diabetes
Gene3 Name=heart disease
Gene4 Name=Childhood ear infection
Gene5 NA
Gene6 Name=kidney disease
根据我提到的条件,我试图达到的输出是:
Gene Condition
Gene1 0.5
Gene2 3
Gene3 2
Gene4 0.5
Gene5 NA
Gene6 1
我不熟悉R,因此不确定我尝试的方法是否是最好的,但我尝试运行我的条件来替换特定字符串(但不是所有字符),如果满足多个条件,则在一行中生成多个数字(与字符串混合),然后对每行应用getmax
函数以获得最大的可用数字。但是,我一直在设置条件以执行字符串到数字的对话。
我一直在努力做到:
data$condition[data$condition == "blood pressure"] <- "3"
data$condition[data$condition == "heart disease"] <- "2"
data$condition[data$condition == "diabetes" | "kidney disease"] <- "1"
data$condition[data$condition == "Name" && !"diabetes" | "kidney disease" | "blood pressure" | "heart disease"] <- "0.5"
data$condition[数据$condition==“血压”]使用grepl:
现在,匹配操作可以表示为SQL中的复杂联接。首先创建numDF
,这是一个两列数据框,第一列中的名称与第二列中的数字相匹配。然后执行连接
library(sqldf)
nums <- c("blood pressure" = 3, heart = 2, diabetes = 1, "kidney disease" = 1)
numDF <- data.frame(Name = names(nums), Value = as.vector(nums))
sqldf("select
a.Gene,
max(case when a.Condition is not Null then coalesce(b.Value, 0.5) end) Condition
from DF a
left join numDF b on a.Condition like '%' || b.Name || '%'
group by Gene", method = "raw")
注
无法在具有内部指针的对象上使用dput,因此我已将dput输出修改为可用:
DF <-
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5",
"Gene6"), Condition = c(" Name=Asymmetrical dimethylarginine leve,l Name=Bipolar disorder and schizophrenia, Name=3-hydroxypropylmercapturic acid levels in smoker",
" Name=blood pressure, Name=diabetes", "Name=heart disease",
"Name=Childhood ear infection", NA, "Name=kidney disease")),
row.names = c(NA, -6L), class = "data.frame")
DF您的数据中存在某些未解决的问题。例如,第2行同时包含具有不同值的血压和糖尿病。在这种情况下应该选择什么?应该选择最高值,道歉应该表明我认为最高值是最重要的,这就是为什么我试图在一个单元格中获取所有数字,然后对单元格/每行应用getmax
res <- data[, list(Condition = unlist(strsplit(Condition, ","))), by = Gene
][, Condition := gsub("Name=", "", Condition) ]
res
# Gene Condition
# 1: Gene1 Asymmetrical dimethylarginine level
# 2: Gene1 Bipolar disorder and schizophrenia
# 3: Gene1 3-hydroxypropylmercapturic acid levels in smoker
# 4: Gene2 blood pressure
# 5: Gene2 diabetes
# 6: Gene3 heart disease
# 7: Gene4 Childhood ear infection
# 8: Gene5 <NA>
# 9: Gene6 kidney disease
library(sqldf)
nums <- c("blood pressure" = 3, heart = 2, diabetes = 1, "kidney disease" = 1)
numDF <- data.frame(Name = names(nums), Value = as.vector(nums))
sqldf("select
a.Gene,
max(case when a.Condition is not Null then coalesce(b.Value, 0.5) end) Condition
from DF a
left join numDF b on a.Condition like '%' || b.Name || '%'
group by Gene", method = "raw")
Gene Condition
1 Gene1 0.5
2 Gene2 3.0
3 Gene3 2.0
4 Gene4 0.5
5 Gene5 NA
6 Gene6 1.0
DF <-
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5",
"Gene6"), Condition = c(" Name=Asymmetrical dimethylarginine leve,l Name=Bipolar disorder and schizophrenia, Name=3-hydroxypropylmercapturic acid levels in smoker",
" Name=blood pressure, Name=diabetes", "Name=heart disease",
"Name=Childhood ear infection", NA, "Name=kidney disease")),
row.names = c(NA, -6L), class = "data.frame")