如何在R中选择重复字符串中最长的ngram?
我有一个类似于以下数据集的数据集(只是有更多行): 正如您所看到的,有一个清晰的模式:在更改基础并再次重新启动进程之前,一次向上一个ngram添加一个单词。让我以第一个“块”为例:如何在R中选择重复字符串中最长的ngram?,r,string,dataframe,substring,gsub,R,String,Dataframe,Substring,Gsub,我有一个类似于以下数据集的数据集(只是有更多行): 正如您所看到的,有一个清晰的模式:在更改基础并再次重新启动进程之前,一次向上一个ngram添加一个单词。让我以第一个“块”为例: [1] "abov level" [2] "abov level consist" [3] "abov lev
[1] "abov level"
[2] "abov level consist"
[3] "abov level consist price"
[4] "abov level consist price stabil"
[5] "abov level consist price stabil protract"
[6] "abov level consist price stabil protract period"
[7] "abov level consist price stabil protract period time"
对于上面的每个“块”,我只保留最长的句子/ngram。在上述情况下,我只保留第七行。对每个街区进行此操作,我会得到:
[1] "abov level consist price stabil protract period time"
[2] "abov level consist price stabil sinc last autumn"
[3] "abov level consist price stabil some time"
[4] "abov over come month"
[5] "abov precis level depend futur energi price develop"
有人能帮我吗
谢谢 您可以计算每个字符串中的字符数,并选择字符数小于上一个字符串的值
inds <- c(which(diff(nchar(x)) < 0), length(x))
x[inds]
#[1] "abov level consist price stabil protract period time"
#[2] "abov level consist price stabil sinc last autumn"
#[3] "abov level consist price stabil some time"
#[4] "abov over come month"
#[5] "abov precis level depend futur energi price develop"
inds我们可以在dplyr
中使用filter
和lead
library(dplyr)
tibble(x) %>%
filter((nchar(lead(x, default = last(x))) - nchar(x)) <= 0)
库(dplyr)
tibble(x)%>%
过滤器((nchar(lead(x,default=last(x)))-nchar(x))如果这个主题在这里被否决了,那么它可以说是代码高尔夫的主题!
library(dplyr)
tibble(x) %>%
filter((nchar(lead(x, default = last(x))) - nchar(x)) <= 0)