如何在R中用两个句子来子集文本？_R_Dataframe

如何在R中用两个句子来子集文本？

r dataframe

如何在R中用两个句子来子集文本？,r,dataframe,R,Dataframe,我有以下数据帧： df = data.frame(Text = c("This is great. A really great place to be. For sure if you wanna solve R issues. Skilled people.", "Good morning. There are very skilled programmers here. They will help sorting this. I am sure."

我有以下数据帧：

df = data.frame(Text = c("This is great. A really great place to be. For sure if you wanna solve R issues. Skilled people.", "Good morning. There are very skilled programmers here. They will help sorting this. I am sure.", "SO is great. You can get many things solve. Additional paragraph."), stringsAsFactors = F)

我习惯于将文本分为几个句子：

library(textshape)

split_sentence(df$Text)

但是，我想每两个句子对“Text”列进行子集设置，以便得到如下列表：

This is great.
A really great place to be.
Good morning.
There are very skilled programmers here. 
SO is great.
You can get many things solve.

有人能帮我吗

谢谢

您可以为每个句子将

文本

拆分为单独的行，并在每行中仅选择前两个句子。使用

dplyr

可以执行以下操作：

library(dplyr)

df %>%
  mutate(row = row_number()) %>%
  tidyr::separate_rows(Text, sep = '\\.\\s*') %>%
  group_by(row) %>%
  slice(1:2) %>%
  ungroup %>%
  select(-row)

#  Text                                   
#  <chr>                                  
#1 This is great                          
#2 A really great place to be             
#3 Good morning                           
#4 There are very skilled programmers here
#5 SO is great                            
#6 You can get many things solve

库（dplyr）
df%>%
变异（行=行编号（））%>%
tidyr:：单独的\u行（文本，sep='\\.\\s*'）%>%
分组依据（行）%>%
切片（1:2）%>%
解组%>%
选择（-行）
#正文
#                                    
#这太棒了
#这是一个非常棒的地方
#早上好
#这里有非常熟练的程序员
#那太好了
#你可以解决很多事情

另一个带有

strsplit

和

头的选项：
unlist(lapply(strsplit(df$Text, '(?<=\\.)\\s*', perl = TRUE), head, 2))
# [1] "This is great."                           "A really great place to be."             
# [3] "Good morning."                            "There are very skilled programmers here."
# [5] "SO is great."                             "You can get many things solve."    

unlist（lapply）（strsplit（df$Text），（？Base R）解决方案，注意此解决方案允许将n设置为任意整数，并在保留/跳过模式中遵循该整数
# Number of sentences to keep before removing the same number of sentences: n => integer scalar 
n <- 2

# Split the string into separate sentences: sentences => list of a character vector
res <- subset(data.frame(sentences = unlist(strsplit(paste0(df$Text, collapse = " "), "(?<=\\.)\\s+", perl = TRUE))),
                    ceiling(seq_along(sentences) / n) %% 2 == 1)[ , 1, drop = TRUE]

# Print the result to console: character vector => stdout (console)
res

# Data: 
df = data.frame(Text = c("This is great. A really great place to be. For sure if you wanna solve R issues. Skilled people.", "Good morning. There are very skilled programmers here. They will help sorting this. I am sure.", "SO is great. You can get many things solve. Additional paragraph."), stringsAsFactors = F)

#删除相同数量的句子之前要保留的句子数：n=>整数标量
n字符向量的列表
res-stdout（控制台）
物件
#数据：
df=data.frame（Text=c（“这太好了。这是一个非常好的地方。如果你想解决R问题，请一定要有熟练的人。”，“早上好。这里有非常熟练的程序员。他们会帮你整理。我相信。”，“太好了。你可以解决很多问题。附加段落。”，“stringsAsFactors=F”）