R 根据条件将所有字符串替换为数字

R 根据条件将所有字符串替换为数字,r,conditional-statements,bioinformatics,R,Conditional Statements,Bioinformatics,我有一列描述可能的疾病的数据。我试图将这些定性的价值观转变为定量的价值观。例如,设置条件,如“如果一行包含单词“血压”,则删除所有字符并替换为3,如果一行包含“心脏”,则替换为2,如果一行包含“糖尿病”或“肾病”,则替换为1,如果任何其他条件替换为0.5” 例如,我的数据如下所示: Gene Condition Gene1 Name=Asymmetrical dimethylarginine level, Name=Bipolar disorder and schizophreni

我有一列描述可能的疾病的数据。我试图将这些定性的价值观转变为定量的价值观。例如,设置条件,如“如果一行包含单词“血压”,则删除所有字符并替换为3,如果一行包含“心脏”,则替换为2,如果一行包含“糖尿病”或“肾病”,则替换为1,如果任何其他条件替换为0.5”

例如,我的数据如下所示:

Gene     Condition
Gene1    Name=Asymmetrical dimethylarginine level, Name=Bipolar disorder and schizophrenia, Name=3-hydroxypropylmercapturic acid levels in smoker
Gene2    Name=blood pressure, Name=diabetes
Gene3    Name=heart disease
Gene4    Name=Childhood ear infection
Gene5    NA
Gene6    Name=kidney disease
根据我提到的条件,我试图达到的输出是:

Gene Condition
Gene1    0.5
Gene2    3
Gene3    2
Gene4    0.5
Gene5    NA
Gene6    1
我不熟悉R,因此不确定我尝试的方法是否是最好的,但我尝试运行我的条件来替换特定字符串(但不是所有字符),如果满足多个条件,则在一行中生成多个数字(与字符串混合),然后对每行应用
getmax
函数以获得最大的可用数字。但是,我一直在设置条件以执行字符串到数字的对话。 我一直在努力做到:

data$condition[data$condition == "blood pressure"] <- "3"
data$condition[data$condition == "heart disease"] <- "2"
data$condition[data$condition == "diabetes" | "kidney disease"] <- "1"
data$condition[data$condition == "Name" && !"diabetes" | "kidney disease" | "blood pressure" | "heart disease"] <- "0.5"
data$condition[数据$condition==“血压”]使用grepl:


现在,匹配操作可以表示为SQL中的复杂联接。首先创建
numDF
,这是一个两列数据框,第一列中的名称与第二列中的数字相匹配。然后执行连接

library(sqldf)

nums <- c("blood pressure" = 3, heart = 2, diabetes = 1, "kidney disease" = 1)
numDF <- data.frame(Name = names(nums), Value = as.vector(nums))

sqldf("select 
    a.Gene, 
    max(case when a.Condition is not Null then coalesce(b.Value, 0.5) end) Condition
  from DF a 
  left join numDF b on a.Condition like '%' || b.Name || '%'
  group by Gene", method = "raw")
注 无法在具有内部指针的对象上使用dput,因此我已将dput输出修改为可用:

DF <-
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6"), Condition = c("    Name=Asymmetrical dimethylarginine leve,l Name=Bipolar disorder and schizophrenia, Name=3-hydroxypropylmercapturic acid levels in smoker", 
"    Name=blood pressure, Name=diabetes", "Name=heart disease", 
"Name=Childhood ear infection", NA, "Name=kidney disease")), 
row.names = c(NA, -6L), class = "data.frame")

DF您的数据中存在某些未解决的问题。例如,第2行同时包含具有不同值的血压和糖尿病。在这种情况下应该选择什么?应该选择最高值,道歉应该表明我认为最高值是最重要的,这就是为什么我试图在一个单元格中获取所有数字,然后对单元格/每行应用
getmax
res <- data[, list(Condition = unlist(strsplit(Condition, ","))), by = Gene
            ][, Condition := gsub("Name=", "", Condition) ]

res
# Gene                                         Condition
# 1: Gene1               Asymmetrical dimethylarginine level
# 2: Gene1                Bipolar disorder and schizophrenia
# 3: Gene1  3-hydroxypropylmercapturic acid levels in smoker
# 4: Gene2                                    blood pressure
# 5: Gene2                                          diabetes
# 6: Gene3                                     heart disease
# 7: Gene4                           Childhood ear infection
# 8: Gene5                                              <NA>
# 9: Gene6                                    kidney disease
library(sqldf)

nums <- c("blood pressure" = 3, heart = 2, diabetes = 1, "kidney disease" = 1)
numDF <- data.frame(Name = names(nums), Value = as.vector(nums))

sqldf("select 
    a.Gene, 
    max(case when a.Condition is not Null then coalesce(b.Value, 0.5) end) Condition
  from DF a 
  left join numDF b on a.Condition like '%' || b.Name || '%'
  group by Gene", method = "raw")
   Gene Condition
1 Gene1       0.5
2 Gene2       3.0
3 Gene3       2.0
4 Gene4       0.5
5 Gene5        NA
6 Gene6       1.0
DF <-
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6"), Condition = c("    Name=Asymmetrical dimethylarginine leve,l Name=Bipolar disorder and schizophrenia, Name=3-hydroxypropylmercapturic acid levels in smoker", 
"    Name=blood pressure, Name=diabetes", "Name=heart disease", 
"Name=Childhood ear infection", NA, "Name=kidney disease")), 
row.names = c(NA, -6L), class = "data.frame")