使用grepl基于另一列创建列 让我们考虑一个 DF,有两列:代码> Word < /COD>和 STOR>代码>。我想创建一个新列,用于检查stem中的值是否包含在word中,以及它的前面或后面是否有更多字符。最终结果应如下所示: WORD STEM NEW rerun run prefixed runner run suffixed run run none ... ... ...
下面你可以看到我的代码。但是,它不起作用,因为使用grepl基于另一列创建列 让我们考虑一个 DF,有两列:代码> Word < /COD>和 STOR>代码>。我想创建一个新列,用于检查stem中的值是否包含在word中,以及它的前面或后面是否有更多字符。最终结果应如下所示: WORD STEM NEW rerun run prefixed runner run suffixed run run none ... ... ...,r,string,dataframe,grepl,startswith,R,String,Dataframe,Grepl,Startswith,下面你可以看到我的代码。但是,它不起作用,因为grepl表达式应用于df的所有行。不管怎样,我认为这应该说明我的想法 df$new <- ifelse(grepl(paste0('.+', df$stem, '.+'), df$word), 'both', ifelse(grepl(paste0(df$stem, '.+'), df$word), 'suffixed', ifelse(grepl(paste0('.+', df$st
grepl
表达式应用于df
的所有行。不管怎样,我认为这应该说明我的想法
df$new <- ifelse(grepl(paste0('.+', df$stem, '.+'), df$word), 'both',
ifelse(grepl(paste0(df$stem, '.+'), df$word), 'suffixed',
ifelse(grepl(paste0('.+', df$stem), df$word), 'prefixed','none')))
df$new您可以使用mapply
每行使用grepl
,如:
ifelse(mapply(grepl, paste0(".+", x$STEM, ".+"), x$WORD), "both",
ifelse(mapply(grepl, paste0(x$STEM, ".+"), x$WORD), "suffixed",
ifelse(mapply(grepl, paste0(".+", x$STEM), x$WORD), "prefixed", "none")))
#"prefixed" "suffixed" "none"
或者使用startsWith
和endsWith
并使用子集形式向量:
c("none", "both", "prefixed", "suffixed")[1 + (1 + startsWith(x$WORD, x$STEM) +
2*endsWith(x$WORD, x$STEM)) * (nchar(x$WORD) > nchar(x$STEM) &
mapply(grepl, x$STEM, x$WORD))]
#[1] "suffixed" "prefixed" "none"
您可以像这样创建new
列
df$new <- ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'none',
ifelse(startsWith(df$word, df$stem), 'suffixed',
ifelse(endsWith(df$word, df$stem), 'prefixed',
'both')))
输出
# word stem new1
# 1 rerun run prefixed
# 2 runner run suffixed
# 3 run run none
# 4 aruna run both
下面是一种使用stru locate
fromstringr
和dplyr
的方法:
library(dplyr)
library(stringr)
data %>%
mutate_at(vars(WORD,STEM), as.character) %>%
mutate(NEW =
case_when(str_locate(WORD,STEM)[,"start"] > 1 &
str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "both",
str_locate(WORD,STEM)[,"start"] > 1 ~ "prefixed",
str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "suffixed",
TRUE ~ "none"))
WORD STEM NEW
1 rerun run prefixed
2 runner run suffixed
3 run run none
库(dplyr)
图书馆(stringr)
数据%>%
在(变号(字,干),如.字符)%>%
变异(新=
当(str_locate(WORD,STEM)[,“start”]>1时的大小写&
str_locate(单词,词干)[,“end”]1~“前缀”,
str_locate(单词,词干)[,“end”]
我添加了一行代码,将单词
和词干
转换为字符,以防它们是因素 谢谢你的快速回复。我选择这个答案作为解决方案,因为它与我的方法最为相似。无论如何,伊恩·坎贝尔也解决了这个问题problem@hyhno01为了让你知道,我更新了我的答案:我取消了比较单词和词干的nchar
,因为我意识到这是多余的。
library(dplyr)
library(stringr)
data %>%
mutate_at(vars(WORD,STEM), as.character) %>%
mutate(NEW =
case_when(str_locate(WORD,STEM)[,"start"] > 1 &
str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "both",
str_locate(WORD,STEM)[,"start"] > 1 ~ "prefixed",
str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "suffixed",
TRUE ~ "none"))
WORD STEM NEW
1 rerun run prefixed
2 runner run suffixed
3 run run none