R 将字符串分隔为不同的列
我有问卷调查的数据。其中一个问题是多项选择题,包括“其他”选项,用户可以编写其他内容。我收到一个Excel文件,其中有一列用于特定问题,每个选项都用分号分隔。以下数据集示例:R 将字符串分隔为不同的列,r,gsub,R,Gsub,我有问卷调查的数据。其中一个问题是多项选择题,包括“其他”选项,用户可以编写其他内容。我收到一个Excel文件,其中有一列用于特定问题,每个选项都用分号分隔。以下数据集示例: ID Prob_saude 1 "Não tenho nenhum dos problemas de saúde indicados;" 2 " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar o
ID Prob_saude
1 "Não tenho nenhum dos problemas de saúde indicados;"
2 " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);"
3 " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);"
4 "Doença autoimmune;"
5 " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;"
6 "HIV;"
7 "Não tenho nenhum dos problemas de saúde indicados;"
8 "Cardiológica;"
我想为每种疾病创建一个带有yes/no的列,以防用户选择该选项。然后,我想用other选项创建另一列。在这种情况下,可用的选项有:
disease <- c(" Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);",
"Hipertensão arterial (tensão arterial alta);", "Doença autoimmune;"
"Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);",
"Não tenho nenhum dos problemas de saúde indicados;")
我可以基于该选项创建额外的列,但当我尝试为其他列创建列时,给定的输出等于列Prob_saude,因此它不排除已选择的选项。有什么想法吗?这就是我目前所拥有的。如果你认为有更好的方法来实现这一点,请随时提出建议
dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]
for (index in 1:length(disease)) {
rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := gsub(disease[index], "", dataset$Prob_saude, fixed = T)]
}
dataset[,粘贴(“Prob_saude”,长度(疾病)+1,sep=“”):=Prob_saude]
用于(索引1:长度(疾病)){
行处理这种情况的一种方法是合并疾病类型列表中列为“其他”的项目。根据数据,原始疾病向量中有5种疾病类型,问卷中有3种新的疾病类型
首先,经过一些清理,我们阅读了与问题一起发布的数据
textFile <- "id|response
1|Não tenho nenhum dos problemas de saúde indicados;
2| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);
3| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);
4|Doença autoimmune;
5| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;
6|HIV;
7|Não tenho nenhum dos problemas de saúde indicados;
8|Cardiológica; "
data <- read.csv(text = textFile,sep = "|",
header = TRUE, stringsAsFactors = FALSE)
disease <- c("Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica)",
"Hipertensão arterial (tensão arterial alta)",
"Doença autoimmune",
"Problemas renais crónicos (doença nos rins, incluindo insuficiência renal)",
"Não tenho nenhum dos problemas de saúde indicados")
此时,数据包含12个观察值和3列
> head(narrowData)
# A tibble: 6 x 3
# Groups: id [4]
id name disease
<int> <chr> <chr>
1 1 resp1 Não tenho nenhum dos problemas de saúde indicados
2 2 resp1 Doença respiratória/pulmonar (incluindo asma, bronquite crónica e d…
3 3 resp1 Doença respiratória/pulmonar (incluindo asma, bronquite crónica e d…
4 3 resp2 Hipertensão arterial (tensão arterial alta)
5 3 resp3 Problemas renais crónicos (doença nos rins, incluindo insuficiência…
6 4 resp1 Doença autoimmune
>
diseaseData
数据框如下所示,问卷中报告但未在原始列表中的疾病位于第6、7和8位
由于我们创建了一个与每个疾病名称关联的唯一序列号,我们现在可以合并数据,并使用疾病id号将数据转回到调查对象id设置的宽格式数据集
narrowData %>% left_join(.,diseaseData) -> joinedData
# create wide format data
joinedData %>% select(id,disease_id) %>% mutate(value = 2) %>%
pivot_wider(.,id_cols = id,names_from = disease_id,names_prefix = "disease",
values_from = value) -> result
最后,我们将输出中的所有NA值设置为1,然后打印
result[is.na(result)] <- 1
result
再一次,我们有一个窄格式整洁的数据框架,每个报告的疾病包含一行
接下来,我们处理这些疾病以识别不在原始选择列表中的已报告疾病,为它们分配一个大于疾病
向量长度的疾病id,并创建一个数据帧
# create disease data frame by combining data with unique values in survey data frame
narrowData %>% distinct(trimws(disease)) %>% .[[1]] -> reportedDiseases
notInDiseaseList <- unique(reportedDiseases[!reportedDiseases %in% disease ])
disease_id <- 1:length(disease)
diseaseData <- data.frame(disease_id,disease,stringsAsFactors = FALSE)
disease_id <- rep(max(diseaseData$disease_id)+1,length(notInDiseaseList))
reportedDiseases <- data.frame(disease_id,disease = notInDiseaseList,stringsAsFactors = FALSE)
diseaseData <- rbind(diseaseData,reportedDiseases)
最后,在使用“pivot_Wither()创建一个包含6列的数据框之前,我们消除了重复项,其中disease_id
等于6,其中6列为1=无疾病,2=5种类型的疾病加上“其他”
#删除后创建宽格式数据
#任何重复项,其中一个受访者有多个疾病报告
joinedData%%>%选择(id,疾病id)%%>%
分组依据(id,疾病id)%>%
变异(值=2,n=行数())%>%
过滤器(n==1)%>%
pivot\u wide(,id\u cols=id,names\u from=disease\u id,names\u prefix=“disease”,
值\u from=value)->结果
结果[是.na(结果)]结果
#一个tibble:8x7
#组别:id[8]
id疾病5疾病1疾病2疾病4疾病3疾病6
1 1 2 1 1 1 1 1
2 2 1 2 1 1 1 1
3 3 1 2 2 2 1 1
4 4 1 1 1 1 2 1
5 5 1 2 2 1 1 2
6 6 1 1 1 1 1 2
7 7 2 1 1 1 1 1
8 8 1 1 1 1 1 2
>
gsub
因括号而无法工作。更改字符串可解决此问题
现在的代码要长一点
dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]
for (index in 1:length(disease)) {
rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
}
disease <- c(" Doença respiratória/pulmonar \\(incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica\\);|Hipertensão arterial \\(tensão arterial alta\\);|Doença autoimmune;|Problemas renais crónicos \\(doença nos rins, incluindo insuficiência renal\\);Não tenho nenhum dos problemas de saúde indicados;")
dataset$other_disease <- gsub(disease, "", dataset$Prob_saude)
dataset[,粘贴(“Prob_saude”,长度(疾病)+1,sep=“”):=Prob_saude]
用于(索引1:长度(疾病)){
行谢谢您的回复。虽然我忘了说这是一个包含200000个条目的文件,其他选项的数量非常多。在这种情况下,我只提供了三个示例。我不确定这是否会太慢而无法计算,并且会创建大量我不需要的额外列。我的想法是保留选项n写入要导出的最后一列。@psoares-在某个时候,您需要分析“其他”答案中的内容。如果要将列表之外的任何内容重新编码为“其他”,该解决方案可以轻松调整以提供该输出。今晚晚些时候,我将更新我的答案,以说明如何将其他答案重新编码为“其他”@psoares-我发布了一个更新的解决方案,将任何不在原始疾病媒介中的报告答案编码为disease6
。谢谢你的回复。你是对的。我可能想分析其他内容,因此需要在列中保留该字符串。如果你查看提供的输出和代码,我想保留最后的co带有疾病字符串的列。使用grep函数处理带有yes/no的列。我的问题是只将另一个字符串放在列中,而不将其余字符串放在列中。在这种情况下,为每个“疾病”保留一列是没有意义的因为人们可以写废话。但正如你所说的,有人需要分析内容,因此他们需要额外的专栏,只包含用户编写的内容。我认为gsub在这方面会很好,但它没有达到我的预期。
result[is.na(result)] <- 1
result
> result
# A tibble: 8 x 9
# Groups: id [8]
id disease1 disease2 disease3 disease4 disease5 disease6 disease7 disease8
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 1 1 1 1 1 1
2 2 1 2 1 1 1 1 1 1
3 3 1 2 2 2 1 1 1 1
4 4 1 1 1 1 2 1 1 1
5 5 1 2 2 1 1 2 1 1
6 6 1 1 1 1 1 1 2 1
7 7 2 1 1 1 1 1 1 1
8 8 1 1 1 1 1 1 1 2
>
library(tidyr)
library(dplyr)
library(glue)
data %>% separate(.,response,into = c("resp1","resp2","resp3","resp4","resp5"),
sep=";") %>% group_by(id) %>%
pivot_longer(.,c(resp1,resp2,resp3,resp4,resp5),values_to = "disease") %>%
mutate(disease = trimws(disease)) %>%
filter(!disease %in% c(NA," "," ","")) -> narrowData
# create disease data frame by combining data with unique values in survey data frame
narrowData %>% distinct(trimws(disease)) %>% .[[1]] -> reportedDiseases
notInDiseaseList <- unique(reportedDiseases[!reportedDiseases %in% disease ])
disease_id <- 1:length(disease)
diseaseData <- data.frame(disease_id,disease,stringsAsFactors = FALSE)
disease_id <- rep(max(diseaseData$disease_id)+1,length(notInDiseaseList))
reportedDiseases <- data.frame(disease_id,disease = notInDiseaseList,stringsAsFactors = FALSE)
diseaseData <- rbind(diseaseData,reportedDiseases)
narrowData %>% left_join(.,diseaseData) -> joinedData
# create wide format data after eliminating
# any duplicates where multiple reported diseases for a respondent
joinedData %>% select(id,disease_id) %>%
group_by(id,disease_id) %>%
mutate(value = 2, n = row_number()) %>%
filter(n == 1) %>%
pivot_wider(.,id_cols = id,names_from = disease_id,names_prefix = "disease",
values_from = value) -> result
result[is.na(result)] <- 1
result
> result
# A tibble: 8 x 7
# Groups: id [8]
id disease5 disease1 disease2 disease4 disease3 disease6
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 1 1 1 1
2 2 1 2 1 1 1 1
3 3 1 2 2 2 1 1
4 4 1 1 1 1 2 1
5 5 1 2 2 1 1 2
6 6 1 1 1 1 1 2
7 7 2 1 1 1 1 1
8 8 1 1 1 1 1 2
>
dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]
for (index in 1:length(disease)) {
rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
}
disease <- c(" Doença respiratória/pulmonar \\(incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica\\);|Hipertensão arterial \\(tensão arterial alta\\);|Doença autoimmune;|Problemas renais crónicos \\(doença nos rins, incluindo insuficiência renal\\);Não tenho nenhum dos problemas de saúde indicados;")
dataset$other_disease <- gsub(disease, "", dataset$Prob_saude)