Regex 拆分R数据框列中的字_Regex_R_Split_Gsub

Regex 拆分R数据框列中的字

regex r

Regex 拆分R数据框列中的字,regex,r,split,gsub,Regex,R,Split,Gsub,我有一个数据框，在一列中用单个空格分隔单词。我想将其分为以下三种类型。数据框如下所示 Text one of the i want to Text split1 split2 split3 one of the one one of of the 我想把它分成如下几部分 Text one of the i want to Text split1 split2 split3 one of the on

我有一个数据框，在一列中用单个空格分隔单词。我想将其分为以下三种类型。数据框如下所示

Text
one of the
i want to

Text         split1     split2    split3
one of the    one       one of     of the

我想把它分成如下几部分

Text
one of the
i want to

Text         split1     split2    split3
one of the    one       one of     of the

我能取得第一名。我想不出另外两个

获取split1的我的代码：

new_data$split1<-sub(" .*","",new_data$Text)

new_data$split1可能有更优雅的解决方案。这里有两个选项：
使用ngrams

：

library(dplyr); library(tm)
df %>% mutate(splits = strsplit(Text, "\\s+")) %>% 
       mutate(split1 = lapply(splits, `[`, 1)) %>% 
       mutate(split2 = lapply(splits, function(words) ngrams(words, 2)[[1]]), 
              split3 = lapply(splits, function(words) ngrams(words, 2)[[2]])) %>% 
       select(-splits)

        Text split1  split2   split3
1 one of the    one one, of  of, the
2  i want to      i i, want want, to

手动提取两克：

df %>% mutate(splits = strsplit(Text, "\\s+")) %>% 
       mutate(split1 = lapply(splits, `[`, 1)) %>% 
       mutate(split2 = lapply(splits, `[`, 1:2), 
              split3 = lapply(splits, `[`, 2:3)) %>% 
       select(-splits)

        Text split1  split2   split3
1 one of the    one one, of  of, the
2  i want to      i i, want want, to

更新：

通过正则表达式，我们可以使用gsub的反向引用

拆分2：

gsub("((.*)\\s+(.*))\\s+(.*)", "\\1", df$Text)
[1] "one of" "i want"

拆分3：

gsub("(.*)\\s+((.*)\\s+(.*))", "\\2", df$Text)
[1] "of the"  "want to"

我们可以尝试使用

gsub

。捕获一个或多个非空白（

\\S+

）作为一个组（在本例中有3个单词），然后在替换中，我们重新排列反向引用并插入分隔符（

，

），用于使用

read.table

转换为不同的列

 df1[paste0("split", 1:3)] <- read.table(text=gsub("(\\S+)\\s+(\\S+)\\s+(\\S+)", 
                  "\\1,\\1 \\2,\\2 \\3", df1$Text), sep=",")
df1
#        Text split1 split2  split3
#1 one of the    one one of  of the
#2  i want to      i i want want to

df1[paste0（“split”，1:3）]这是一个有点粗俗的解决方案
假设：-您不关心两个单词之间的空格数
> library(stringr)
> x<-c('one of the','i want to')
> strsplit(gsub('(\\S+)\\s+(\\S+)\\s+(.*)', '\\1  \\1 \\2   \\2 \\3', x), '\\s\\s+')
#[[1]]
#[1] "one"    "one of" "of the"

#[[2]]
#[1] "i"       "i want"  "want to"

>库（stringr）
>x strsplit（gsub（“（\\S+）\\S+（\\S+）\\S+（.*）”，\\1\\1\\2\\3'，x），“\\S\\S+”）
#[[1]]
#[1] 中的一个
#[[2]]
#[1] “我”“我想”“我想”
请参见？cSplit
fromsplitstackshape
是否始终有三个单词？是的，始终有三个可以获得拆分2。。。尽管如此，split3仍然需要一些关于regex的帮助：-）这几乎与我的类似answer@akrun嗯，看起来是这样的。。我打字很慢