Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 将字符串分隔为不同的列_R_Gsub - Fatal编程技术网

R 将字符串分隔为不同的列

R 将字符串分隔为不同的列,r,gsub,R,Gsub,我有问卷调查的数据。其中一个问题是多项选择题,包括“其他”选项,用户可以编写其他内容。我收到一个Excel文件,其中有一列用于特定问题,每个选项都用分号分隔。以下数据集示例: ID Prob_saude 1 "Não tenho nenhum dos problemas de saúde indicados;" 2 " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar o

我有问卷调查的数据。其中一个问题是多项选择题,包括“其他”选项,用户可以编写其他内容。我收到一个Excel文件,其中有一列用于特定问题,每个选项都用分号分隔。以下数据集示例:

ID  Prob_saude
1   "Não tenho nenhum dos problemas de saúde indicados;" 
2   " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);"
3   " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);"
4   "Doença autoimmune;" 
5   " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;"  
6    "HIV;"
7    "Não tenho nenhum dos problemas de saúde indicados;" 
8    "Cardiológica;" 
我想为每种疾病创建一个带有yes/no的列,以防用户选择该选项。然后,我想用other选项创建另一列。在这种情况下,可用的选项有:

disease <- c(" Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);",
         "Hipertensão arterial (tensão arterial alta);", "Doença autoimmune;"
         "Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);",
         "Não tenho nenhum dos problemas de saúde indicados;") 
我可以基于该选项创建额外的列,但当我尝试为其他列创建列时,给定的输出等于列Prob_saude,因此它不排除已选择的选项。有什么想法吗?这就是我目前所拥有的。如果你认为有更好的方法来实现这一点,请随时提出建议

dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]

for (index in 1:length(disease)) {
    rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
    dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
    dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := gsub(disease[index], "", dataset$Prob_saude, fixed = T)]
}
dataset[,粘贴(“Prob_saude”,长度(疾病)+1,sep=“”):=Prob_saude]
用于(索引1:长度(疾病)){

行处理这种情况的一种方法是合并疾病类型列表中列为“其他”的项目。根据数据,原始
疾病
向量中有5种疾病类型,问卷中有3种新的疾病类型

首先,经过一些清理,我们阅读了与问题一起发布的数据

textFile <- "id|response
1|Não tenho nenhum dos problemas de saúde indicados; 
2| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);
3| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);
4|Doença autoimmune; 
5| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;  
6|HIV;
7|Não tenho nenhum dos problemas de saúde indicados; 
8|Cardiológica; "

data <- read.csv(text = textFile,sep = "|",
                 header = TRUE, stringsAsFactors = FALSE)
disease <- c("Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica)",
             "Hipertensão arterial (tensão arterial alta)", 
             "Doença autoimmune",
             "Problemas renais crónicos (doença nos rins, incluindo insuficiência renal)",
             "Não tenho nenhum dos problemas de saúde indicados")
此时,数据包含12个观察值和3列

> head(narrowData)
# A tibble: 6 x 3
# Groups:   id [4]
     id name  disease                                                             
  <int> <chr> <chr>                                                               
1     1 resp1 Não tenho nenhum dos problemas de saúde indicados                   
2     2 resp1 Doença respiratória/pulmonar (incluindo asma, bronquite crónica e d…
3     3 resp1 Doença respiratória/pulmonar (incluindo asma, bronquite crónica e d…
4     3 resp2 Hipertensão arterial (tensão arterial alta)                         
5     3 resp3 Problemas renais crónicos (doença nos rins, incluindo insuficiência…
6     4 resp1 Doença autoimmune                                                   
> 
diseaseData
数据框如下所示,问卷中报告但未在原始列表中的疾病位于第6、7和8位

由于我们创建了一个与每个疾病名称关联的唯一序列号,我们现在可以合并数据,并使用疾病id号将数据转回到调查对象id设置的宽格式数据集

narrowData %>% left_join(.,diseaseData) -> joinedData
# create wide format data 
joinedData %>% select(id,disease_id) %>% mutate(value = 2) %>%
     pivot_wider(.,id_cols = id,names_from = disease_id,names_prefix = "disease",
                 values_from = value) -> result
最后,我们将输出中的所有NA值设置为1,然后打印

result[is.na(result)] <- 1
result
再一次,我们有一个窄格式整洁的数据框架,每个报告的疾病包含一行

接下来,我们处理这些疾病以识别不在原始选择列表中的已报告疾病,为它们分配一个大于
疾病
向量长度的疾病id,并创建一个数据帧

# create disease data frame by combining data with unique values in survey data frame
narrowData %>% distinct(trimws(disease)) %>% .[[1]] -> reportedDiseases
notInDiseaseList <- unique(reportedDiseases[!reportedDiseases %in% disease ])
disease_id <- 1:length(disease)
diseaseData <- data.frame(disease_id,disease,stringsAsFactors = FALSE)
disease_id <- rep(max(diseaseData$disease_id)+1,length(notInDiseaseList))
reportedDiseases <- data.frame(disease_id,disease = notInDiseaseList,stringsAsFactors = FALSE)
diseaseData <- rbind(diseaseData,reportedDiseases)
最后,在使用“pivot_Wither()创建一个包含6列的数据框之前,我们消除了重复项,其中
disease_id
等于6,其中6列为1=无疾病,2=5种类型的疾病加上“其他”

#删除后创建宽格式数据
#任何重复项,其中一个受访者有多个疾病报告
joinedData%%>%选择(id,疾病id)%%>%
分组依据(id,疾病id)%>%
变异(值=2,n=行数())%>%
过滤器(n==1)%>%
pivot\u wide(,id\u cols=id,names\u from=disease\u id,names\u prefix=“disease”,
值\u from=value)->结果
结果[是.na(结果)]结果
#一个tibble:8x7
#组别:id[8]
id疾病5疾病1疾病2疾病4疾病3疾病6
1     1        2        1        1        1        1        1
2     2        1        2        1        1        1        1
3     3        1        2        2        2        1        1
4     4        1        1        1        1        2        1
5     5        1        2        2        1        1        2
6     6        1        1        1        1        1        2
7     7        2        1        1        1        1        1
8     8        1        1        1        1        1        2
> 

gsub
因括号而无法工作。更改字符串可解决此问题

现在的代码要长一点

 dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]
for (index in 1:length(disease)) {
    rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
    dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
}

disease <- c(" Doença respiratória/pulmonar \\(incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica\\);|Hipertensão arterial \\(tensão arterial alta\\);|Doença autoimmune;|Problemas renais crónicos \\(doença nos rins, incluindo insuficiência renal\\);Não tenho nenhum dos problemas de saúde indicados;") 
dataset$other_disease <- gsub(disease, "", dataset$Prob_saude)
dataset[,粘贴(“Prob_saude”,长度(疾病)+1,sep=“”):=Prob_saude]
用于(索引1:长度(疾病)){

行谢谢您的回复。虽然我忘了说这是一个包含200000个条目的文件,其他选项的数量非常多。在这种情况下,我只提供了三个示例。我不确定这是否会太慢而无法计算,并且会创建大量我不需要的额外列。我的想法是保留选项n写入要导出的最后一列。@psoares-在某个时候,您需要分析“其他”答案中的内容。如果要将列表之外的任何内容重新编码为“其他”,该解决方案可以轻松调整以提供该输出。今晚晚些时候,我将更新我的答案,以说明如何将其他答案重新编码为“其他”@psoares-我发布了一个更新的解决方案,将任何不在原始疾病媒介中的报告答案编码为
disease6
。谢谢你的回复。你是对的。我可能想分析其他内容,因此需要在列中保留该字符串。如果你查看提供的输出和代码,我想保留最后的co带有疾病字符串的列。使用grep函数处理带有yes/no的列。我的问题是只将另一个字符串放在列中,而不将其余字符串放在列中。在这种情况下,为每个“疾病”保留一列是没有意义的因为人们可以写废话。但正如你所说的,有人需要分析内容,因此他们需要额外的专栏,只包含用户编写的内容。我认为gsub在这方面会很好,但它没有达到我的预期。
result[is.na(result)] <- 1
result
> result
# A tibble: 8 x 9
# Groups:   id [8]
     id disease1 disease2 disease3 disease4 disease5 disease6 disease7 disease8
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1     1        2        1        1        1        1        1        1        1
2     2        1        2        1        1        1        1        1        1
3     3        1        2        2        2        1        1        1        1
4     4        1        1        1        1        2        1        1        1
5     5        1        2        2        1        1        2        1        1
6     6        1        1        1        1        1        1        2        1
7     7        2        1        1        1        1        1        1        1
8     8        1        1        1        1        1        1        1        2
> 
library(tidyr)
library(dplyr)
library(glue)
data %>% separate(.,response,into = c("resp1","resp2","resp3","resp4","resp5"),
                  sep=";")  %>% group_by(id) %>%
     pivot_longer(.,c(resp1,resp2,resp3,resp4,resp5),values_to = "disease") %>%
     mutate(disease = trimws(disease)) %>%
     filter(!disease %in% c(NA," ","  ",""))    -> narrowData
# create disease data frame by combining data with unique values in survey data frame
narrowData %>% distinct(trimws(disease)) %>% .[[1]] -> reportedDiseases
notInDiseaseList <- unique(reportedDiseases[!reportedDiseases %in% disease ])
disease_id <- 1:length(disease)
diseaseData <- data.frame(disease_id,disease,stringsAsFactors = FALSE)
disease_id <- rep(max(diseaseData$disease_id)+1,length(notInDiseaseList))
reportedDiseases <- data.frame(disease_id,disease = notInDiseaseList,stringsAsFactors = FALSE)
diseaseData <- rbind(diseaseData,reportedDiseases)
narrowData %>% left_join(.,diseaseData) -> joinedData
# create wide format data after eliminating 
# any duplicates where multiple reported diseases for a respondent
joinedData %>% select(id,disease_id) %>% 
     group_by(id,disease_id) %>%
     mutate(value = 2, n = row_number()) %>%
     filter(n == 1) %>% 
     pivot_wider(.,id_cols = id,names_from = disease_id,names_prefix = "disease",
                 values_from = value) -> result
result[is.na(result)] <- 1
result
> result
# A tibble: 8 x 7
# Groups:   id [8]
     id disease5 disease1 disease2 disease4 disease3 disease6
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1     1        2        1        1        1        1        1
2     2        1        2        1        1        1        1
3     3        1        2        2        2        1        1
4     4        1        1        1        1        2        1
5     5        1        2        2        1        1        2
6     6        1        1        1        1        1        2
7     7        2        1        1        1        1        1
8     8        1        1        1        1        1        2
> 
 dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]
for (index in 1:length(disease)) {
    rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
    dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
}

disease <- c(" Doença respiratória/pulmonar \\(incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica\\);|Hipertensão arterial \\(tensão arterial alta\\);|Doença autoimmune;|Problemas renais crónicos \\(doença nos rins, incluindo insuficiência renal\\);Não tenho nenhum dos problemas de saúde indicados;") 
dataset$other_disease <- gsub(disease, "", dataset$Prob_saude)