如何在R中用两个句子来子集文本?

如何在R中用两个句子来子集文本?,r,dataframe,R,Dataframe,我有以下数据帧: df = data.frame(Text = c("This is great. A really great place to be. For sure if you wanna solve R issues. Skilled people.", "Good morning. There are very skilled programmers here. They will help sorting this. I am sure."

我有以下数据帧:

df = data.frame(Text = c("This is great. A really great place to be. For sure if you wanna solve R issues. Skilled people.", "Good morning. There are very skilled programmers here. They will help sorting this. I am sure.", "SO is great. You can get many things solve. Additional paragraph."), stringsAsFactors = F)

我习惯于将文本分为几个句子:

library(textshape)

split_sentence(df$Text)

但是,我想每两个句子对“Text”列进行子集设置,以便得到如下列表:

This is great.
A really great place to be.
Good morning.
There are very skilled programmers here. 
SO is great.
You can get many things solve.

有人能帮我吗


谢谢

您可以为每个句子将
文本
拆分为单独的行,并在每行中仅选择前两个句子。使用
dplyr
可以执行以下操作:

library(dplyr)

df %>%
  mutate(row = row_number()) %>%
  tidyr::separate_rows(Text, sep = '\\.\\s*') %>%
  group_by(row) %>%
  slice(1:2) %>%
  ungroup %>%
  select(-row)

#  Text                                   
#  <chr>                                  
#1 This is great                          
#2 A really great place to be             
#3 Good morning                           
#4 There are very skilled programmers here
#5 SO is great                            
#6 You can get many things solve        
库(dplyr)
df%>%
变异(行=行编号())%>%
tidyr::单独的\u行(文本,sep='\\.\\s*')%>%
分组依据(行)%>%
切片(1:2)%>%
解组%>%
选择(-行)
#正文
#                                    
#这太棒了
#这是一个非常棒的地方
#早上好
#这里有非常熟练的程序员
#那太好了
#你可以解决很多事情

另一个带有
strsplit
头的选项:

unlist(lapply(strsplit(df$Text, '(?<=\\.)\\s*', perl = TRUE), head, 2))
# [1] "This is great."                           "A really great place to be."             
# [3] "Good morning."                            "There are very skilled programmers here."
# [5] "SO is great."                             "You can get many things solve."    

unlist(lapply)(strsplit(df$Text),(?Base R)解决方案,注意此解决方案允许将n设置为任意整数,并在保留/跳过模式中遵循该整数

# Number of sentences to keep before removing the same number of sentences: n => integer scalar 
n <- 2

# Split the string into separate sentences: sentences => list of a character vector
res <- subset(data.frame(sentences = unlist(strsplit(paste0(df$Text, collapse = " "), "(?<=\\.)\\s+", perl = TRUE))),
                    ceiling(seq_along(sentences) / n) %% 2 == 1)[ , 1, drop = TRUE]

# Print the result to console: character vector => stdout (console)
res

# Data: 
df = data.frame(Text = c("This is great. A really great place to be. For sure if you wanna solve R issues. Skilled people.", "Good morning. There are very skilled programmers here. They will help sorting this. I am sure.", "SO is great. You can get many things solve. Additional paragraph."), stringsAsFactors = F)
#删除相同数量的句子之前要保留的句子数:n=>整数标量
n字符向量的列表
res-stdout(控制台)
物件
#数据:
df=data.frame(Text=c(“这太好了。这是一个非常好的地方。如果你想解决R问题,请一定要有熟练的人。”,“早上好。这里有非常熟练的程序员。他们会帮你整理。我相信。”,“太好了。你可以解决很多问题。附加段落。”,“stringsAsFactors=F”)