R 正则表达式正在工作,但代码看起来很糟糕
我正在清理一长串名词短语,以便进一步进行文本挖掘。它们应该是一个或两个单词的短语,但有些有/在连词中。以下是我得到的:R 正则表达式正在工作,但代码看起来很糟糕,r,nested,tidyverse,text-mining,R,Nested,Tidyverse,Text Mining,我正在清理一长串名词短语,以便进一步进行文本挖掘。它们应该是一个或两个单词的短语,但有些有/在连词中。以下是我得到的: library(tidyverse) conjuncts <- tibble(usecase = 1:3, classes = c("Insulators/Insulation", "Optic/light fiber",
library(tidyverse)
conjuncts <- tibble(usecase = 1:3,
classes = c("Insulators/Insulation",
"Optic/light fiber",
"Magnets"))
我想:
wanted <- tibble(usecase = c(1,1,2,2,3),
classes = c("Insulators/Insulation",
"Insulators/Insulation",
"Optic/light fiber",
"Optic/light fiber",
"Magnets"),
bigrams = c("Insulators", "Insulation",
"Optic fiber", "Light fiber", NA))
我有一些东西在工作,但它很可怕,不可扩展
patternSplit <- function(class){
regexs <- c("(?x) ^ (\\w+) / (\\w+) $",
"(?x) ^ (\\w+) / (\\w+) \\s+ (\\w+) $")
if(str_detect(class, regexs[1])){
extr <- str_match(class, regexs[1])
list(extr[1,2],
extr[1,3])
} else if(str_detect(class, regexs[2])){
extr <- str_match(class, regexs[2])
list(paste(extr[1,2], extr[1,4]),
paste(extr[1,3], extr[1,4]))
} else {
list(NA_character_)
}
}
anx <- conjuncts %>%
mutate(bigrams = map(classes, patternSplit)) %>%
unnest(cols = "bigrams") %>%
unnest(cols = "bigrams")
这给了我我想要的,但是天哪
前两个问题1我必须运行rexex两次-一次使用str_detect获取if/else的逻辑,另一次使用str_match提取令牌。2我已经尽了最大的努力来解开列表结构。小问题3我能不能从if/else中退出,进入case\u when或switch
我最终会将其扩展到十几种模式和用例。下面是一个解决方案,使用/作为分隔符来检测单词短语,然后使用ifelse来获得所需的结果:
patternSplit<- function(x,p="[A-z]+[/][A-z]+"){
x1<- stringr::str_extract(x,p)
x2<- stringr::str_replace(x,p,"")
return(cbind(val1=x1,val2=x2))
}
conjuncts<- cbind(conjuncts,conjuncts$classes %>% patternSplit()) %>%
tidyr::separate_rows(val1, sep = '/') %>%
dplyr::mutate(bigrams= ifelse(!is.na(val1),paste0(val1,val2),val1)) %>%
dplyr::select(-contains("val"))
conjuncts
usecase classes bigrams
1 1 Insulators/Insulation Insulators
2 1 Insulators/Insulation Insulation
3 2 Optic/light fiber Optic fiber
4 2 Optic/light fiber light fiber
5 3 Magnets <NA>
您是否需要与conjuncts%>%单独的行类不同的内容,sep='/'?是的,取决于模式。您的代码在我需要“光纤”的地方生成“光纤”。还有更多的模式需要添加。例如,集成电路/微电路将转到“集成电路”和“集成微电路”。所有情况都从/开始分裂,但随后根据/两侧的标记数量重新组装。
patternSplit<- function(x,p="[A-z]+[/][A-z]+"){
x1<- stringr::str_extract(x,p)
x2<- stringr::str_replace(x,p,"")
return(cbind(val1=x1,val2=x2))
}
conjuncts<- cbind(conjuncts,conjuncts$classes %>% patternSplit()) %>%
tidyr::separate_rows(val1, sep = '/') %>%
dplyr::mutate(bigrams= ifelse(!is.na(val1),paste0(val1,val2),val1)) %>%
dplyr::select(-contains("val"))
conjuncts
usecase classes bigrams
1 1 Insulators/Insulation Insulators
2 1 Insulators/Insulation Insulation
3 2 Optic/light fiber Optic fiber
4 2 Optic/light fiber light fiber
5 3 Magnets <NA>