R 检查非唯一字符的字符串模式

R 检查非唯一字符的字符串模式,r,string,strsplit,R,String,Strsplit,我有一个包含两列的数据框:id和gradelist 成绩列表列中的值包括一个长度不同的成绩列表(由;分隔) 以下是数据: id <- seq(1,7) gradelist <- c("a;b;b", "c;c", "d;d;d;f", "f;f;f;f;f;f", "a;a;a;a", "f;b;b;b;b;b;b;b", "c;c;d;d;

我有一个包含两列的数据框:
id
gradelist

成绩列表
列中的值包括一个长度不同的成绩列表(由
分隔)

以下是数据:

id <- seq(1,7)
gradelist <- c("a;b;b",
            "c;c",
            "d;d;d;f",
            "f;f;f;f;f;f",
            "a;a;a;a",
            "f;b;b;b;b;b;b;b",
            "c;c;d;d;a;a")

df <- data.frame(id, gradelist)
df$gradelist <- as.character(df$gradelist)

id我们可以提取字符,并使用
n_distinct
检查以发现不同元素的数量为1

library(dplyr)
library(purrr)
df %>% 
   mutate(same = map_chr(str_extract_all(gradelist, "[a-z]"), 
       ~ c("no", "yes")[1+(n_distinct(.x)==1)]))
#   id       gradelist same
#1  1           a;b;b   no
#2  2             c;c  yes
#3  3         d;d;d;f   no
#4  4     f;f;f;f;f;f  yes
#5  5         a;a;a;a  yes
#6  6 f;b;b;b;b;b;b;b   no
#7  7     c;c;d;d;a;a   no
或在

df %>% 
   mutate(same = map_chr(str_extract_all(gradelist, "[a-z]"), ~
         case_when(n_distinct(.x) == 1 ~ "yes", TRUE ~ "no")))

或者另一个选项是“成绩表”上的
separate_行
,要展开数据,请查找
n_distinct

library(tidyr)
df %>% 
    separate_rows(gradelist) %>%
    distinct %>% 
    group_by(id) %>% 
    summarise(same = c("no", "yes")[1 + (n_distinct(gradelist) == 1)]) %>% 
    left_join(df)
尝试:

输出:

  id       gradelist same
1  1           a;b;b   No
2  2             c;c  Yes
3  3         d;d;d;f   No
4  4     f;f;f;f;f;f  Yes
5  5         a;a;a;a  Yes
6  6 f;b;b;b;b;b;b;b   No
7  7     c;c;d;d;a;a   No
您也可以按如下方式选择strsplit:

transform(df, same = c('No', 'Yes')[sapply(strsplit(gradelist, split = ';'), function(x) length(unique(unlist(x))) == 1) + 1])

基准

我们将字符串重复几次。我们还重复
df
的行,这样我们最终得到略多于100k行,并分配@ThomasIsCoding使用的函数

df$gradelist <- sapply(df$gradelist, function(x) paste(replicate(20, x), collapse = ";"))

df <- df[rep(seq_len(nrow(df)), each = 15000), ]

f <- Vectorize(function(x) ifelse(length(unique(unlist(strsplit(x,";"))))==1,"yes","no"))
结果:

Unit: seconds
            expr       min        lq      mean    median        uq       max neval
          akrun1 19.684781 19.912789 21.084244 20.646490 21.606763 24.008420    10
          akrun2 30.393006 31.066965 32.590679 31.824528 33.567449 37.780535    10
          akrun3  6.378463  7.190472  7.379439  7.373730  7.704365  8.321929    10
              db  3.738271  3.785858  3.935769  3.911479  3.926385  4.523876    10
             M--  3.551592  3.648720  3.723315  3.741075  3.798664  3.915588    10
 ThomasIsCoding1  4.453528  4.498858  4.702160  4.613088  4.823517  5.379984    10
 ThomasIsCoding2  3.368358  3.532593  3.752111  3.610664  3.773345  4.969414    10
    arg0naut91_1  1.638212  1.683986  1.699327  1.704614  1.716077  1.759059    10
    arg0naut91_2  3.665604  3.739662  3.774542  3.750144  3.774753  4.071887    10
绘图:


检查哪个字符位于第一位,并用空字符串替换该字符的所有出现处。如果没有留下任何内容,则表示所有字符都相同

sapply(df$gradelist, function(x) {
    nchar(gsub(paste0(substring(x, 1, 1), "|;"), "", x)) == 0
}, USE.NAMES = FALSE)
#[1] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
df$相同id等级列表相同
#>1 a;Bb不
#>2 c;c是的
#>3d;DDf否
#>4 f;FFFFf是的
#>5 a;A.A.a是的
#>6 f;BBBBBBb不
#>7 c;CDDA.a不

以下是一些基本的R解决方案

  • 定义自定义函数
    f
    ,即
  • 使用
    regmatches
    +
    sapply

这些方法当然可以按原样工作,但为了提高可读性,您可以使用
if\u else
,而不是子集一个yes/no向量。在
map
summary
中,函数/表达式将变成
if_else(n_distinct(.x)=1,“yes”,“no”)
我不记得是不是你,但以前有一个问题需要检查字符是否相同,是否出现了ssomebody。与基准
sapply(df$gradelist,函数(x)长度(unique(charToRaw(x))@akrun不是我。不确定这在r中是否重要,但
df感谢@M-,你确实是对的-没有时间进行适当的采样,但是根据我的经验,我相信我们会看到类似的排名(至少在“原生”r解决方案中)。
Unit: seconds
            expr       min        lq      mean    median        uq       max neval
          akrun1 19.684781 19.912789 21.084244 20.646490 21.606763 24.008420    10
          akrun2 30.393006 31.066965 32.590679 31.824528 33.567449 37.780535    10
          akrun3  6.378463  7.190472  7.379439  7.373730  7.704365  8.321929    10
              db  3.738271  3.785858  3.935769  3.911479  3.926385  4.523876    10
             M--  3.551592  3.648720  3.723315  3.741075  3.798664  3.915588    10
 ThomasIsCoding1  4.453528  4.498858  4.702160  4.613088  4.823517  5.379984    10
 ThomasIsCoding2  3.368358  3.532593  3.752111  3.610664  3.773345  4.969414    10
    arg0naut91_1  1.638212  1.683986  1.699327  1.704614  1.716077  1.759059    10
    arg0naut91_2  3.665604  3.739662  3.774542  3.750144  3.774753  4.071887    10
sapply(df$gradelist, function(x) {
    nchar(gsub(paste0(substring(x, 1, 1), "|;"), "", x)) == 0
}, USE.NAMES = FALSE)
#[1] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
f <- Vectorize(function(x) ifelse(length(unique(unlist(strsplit(x,";"))))==1,"yes","no"))
df$same <- f(df$gradelist)
df <- within(df,same <- sapply(regmatches(gradelist,gregexpr("\\w",gradelist)),function(x) ifelse(length(unique(x))==1,"yes","no")))
> df
  id       gradelist same
1  1           a;b;b   no
2  2             c;c  yes
3  3         d;d;d;f   no
4  4     f;f;f;f;f;f  yes
5  5         a;a;a;a  yes
6  6 f;b;b;b;b;b;b;b   no
7  7     c;c;d;d;a;a   no