Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/83.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/string/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在R中选择重复字符串中最长的ngram?_R_String_Dataframe_Substring_Gsub - Fatal编程技术网

如何在R中选择重复字符串中最长的ngram?

如何在R中选择重复字符串中最长的ngram?,r,string,dataframe,substring,gsub,R,String,Dataframe,Substring,Gsub,我有一个类似于以下数据集的数据集(只是有更多行): 正如您所看到的,有一个清晰的模式:在更改基础并再次重新启动进程之前,一次向上一个ngram添加一个单词。让我以第一个“块”为例: [1] "abov level" [2] "abov level consist" [3] "abov lev

我有一个类似于以下数据集的数据集(只是有更多行):

正如您所看到的,有一个清晰的模式:在更改基础并再次重新启动进程之前,一次向上一个ngram添加一个单词。让我以第一个“块”为例:

 [1] "abov level"                                          
 [2] "abov level consist"                                  
 [3] "abov level consist price"                            
 [4] "abov level consist price stabil"                     
 [5] "abov level consist price stabil protract"            
 [6] "abov level consist price stabil protract period"     
 [7] "abov level consist price stabil protract period time"

对于上面的每个“块”,我只保留最长的句子/ngram。在上述情况下,我只保留第七行。对每个街区进行此操作,我会得到:

    
 [1] "abov level consist price stabil protract period time"           
 [2] "abov level consist price stabil sinc last autumn"    
 [3] "abov level consist price stabil some time"                                              
 [4] "abov over come month"                                      
 [5] "abov precis level depend futur energi price develop"

有人能帮我吗


谢谢

您可以计算每个字符串中的字符数,并选择字符数小于上一个字符串的值

inds <- c(which(diff(nchar(x)) < 0), length(x))
x[inds]

#[1] "abov level consist price stabil protract period time"
#[2] "abov level consist price stabil sinc last autumn"    
#[3] "abov level consist price stabil some time"           
#[4] "abov over come month"                                
#[5] "abov precis level depend futur energi price develop" 

inds我们可以在
dplyr
中使用
filter
lead

library(dplyr)
tibble(x) %>%
     filter((nchar(lead(x, default = last(x))) - nchar(x)) <= 0)
库(dplyr)
tibble(x)%>%

过滤器((nchar(lead(x,default=last(x)))-nchar(x))如果这个主题在这里被否决了,那么它可以说是代码高尔夫的主题!
library(dplyr)
tibble(x) %>%
     filter((nchar(lead(x, default = last(x))) - nchar(x)) <= 0)