R 正则表达式正在工作,但代码看起来很糟糕

R 正则表达式正在工作,但代码看起来很糟糕,r,nested,tidyverse,text-mining,R,Nested,Tidyverse,Text Mining,我正在清理一长串名词短语,以便进一步进行文本挖掘。它们应该是一个或两个单词的短语,但有些有/在连词中。以下是我得到的: library(tidyverse) conjuncts <- tibble(usecase = 1:3, classes = c("Insulators/Insulation", "Optic/light fiber",

我正在清理一长串名词短语,以便进一步进行文本挖掘。它们应该是一个或两个单词的短语,但有些有/在连词中。以下是我得到的:

library(tidyverse)
conjuncts <- tibble(usecase = 1:3,
                   classes = c("Insulators/Insulation",
                               "Optic/light fiber",
                               "Magnets"))
我想:

wanted <- tibble(usecase = c(1,1,2,2,3),
                 classes =  c("Insulators/Insulation",
                              "Insulators/Insulation",
                              "Optic/light fiber",
                              "Optic/light fiber",
                              "Magnets"),
                 bigrams = c("Insulators", "Insulation",
                             "Optic fiber", "Light fiber", NA))
我有一些东西在工作,但它很可怕,不可扩展

patternSplit <- function(class){
  regexs <- c("(?x) ^ (\\w+) / (\\w+) $",
              "(?x) ^ (\\w+) / (\\w+) \\s+ (\\w+) $")
  if(str_detect(class, regexs[1])){
    extr <- str_match(class, regexs[1])
    list(extr[1,2],
         extr[1,3]) 
  } else if(str_detect(class, regexs[2])){
    extr <- str_match(class, regexs[2])
    list(paste(extr[1,2], extr[1,4]), 
         paste(extr[1,3], extr[1,4])) 
  } else {
    list(NA_character_)
  }
}

anx <- conjuncts %>% 
  mutate(bigrams = map(classes, patternSplit)) %>% 
  unnest(cols = "bigrams") %>% 
  unnest(cols = "bigrams")
这给了我我想要的,但是天哪

前两个问题1我必须运行rexex两次-一次使用str_detect获取if/else的逻辑,另一次使用str_match提取令牌。2我已经尽了最大的努力来解开列表结构。小问题3我能不能从if/else中退出,进入case\u when或switch

我最终会将其扩展到十几种模式和用例。

下面是一个解决方案,使用/作为分隔符来检测单词短语,然后使用ifelse来获得所需的结果:

patternSplit<- function(x,p="[A-z]+[/][A-z]+"){
  x1<- stringr::str_extract(x,p)
  x2<- stringr::str_replace(x,p,"")
  return(cbind(val1=x1,val2=x2))
}

conjuncts<- cbind(conjuncts,conjuncts$classes %>% patternSplit()) %>% 
  tidyr::separate_rows(val1, sep = '/') %>% 
  dplyr::mutate(bigrams= ifelse(!is.na(val1),paste0(val1,val2),val1)) %>%
  dplyr::select(-contains("val"))

conjuncts
  usecase               classes     bigrams
1       1 Insulators/Insulation  Insulators
2       1 Insulators/Insulation  Insulation
3       2     Optic/light fiber Optic fiber
4       2     Optic/light fiber light fiber
5       3               Magnets        <NA>

您是否需要与conjuncts%>%单独的行类不同的内容,sep='/'?是的,取决于模式。您的代码在我需要“光纤”的地方生成“光纤”。还有更多的模式需要添加。例如,集成电路/微电路将转到“集成电路”和“集成微电路”。所有情况都从/开始分裂,但随后根据/两侧的标记数量重新组装。
patternSplit<- function(x,p="[A-z]+[/][A-z]+"){
  x1<- stringr::str_extract(x,p)
  x2<- stringr::str_replace(x,p,"")
  return(cbind(val1=x1,val2=x2))
}

conjuncts<- cbind(conjuncts,conjuncts$classes %>% patternSplit()) %>% 
  tidyr::separate_rows(val1, sep = '/') %>% 
  dplyr::mutate(bigrams= ifelse(!is.na(val1),paste0(val1,val2),val1)) %>%
  dplyr::select(-contains("val"))

conjuncts
  usecase               classes     bigrams
1       1 Insulators/Insulation  Insulators
2       1 Insulators/Insulation  Insulation
3       2     Optic/light fiber Optic fiber
4       2     Optic/light fiber light fiber
5       3               Magnets        <NA>