R 如何计算“数量”&引用;每列

R 如何计算“数量”&引用;每列,r,R,我有一个奇怪的问题。如果我有一些句子,我想计算每个句子中有多少个“,”,新变量number等于number of,+1。我该怎么做?看起来像这样的东西: 可以使用以下代码生成示例数据: df<-structure(list(Outcome = c("Happy, New", "Year, to, you", "this", "is, a , very", "strange, question&qu

我有一个奇怪的问题。如果我有一些句子,我想计算每个句子中有多少个“,”,新变量
number
等于
number of,+1
。我该怎么做?看起来像这样的东西:

可以使用以下代码生成示例数据:

df<-structure(list(Outcome = c("Happy, New", "Year, to, you", "this", 
"is, a , very", "strange, question")), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

df使用
stru count
计算单词数更容易

library(stringr)
library(dplyr)
df %>% 
    mutate(Number = str_count(Outcome, "\\w+"))
-输出

# A tibble: 5 x 2
#  Outcome           Number
#  <chr>              <int>
#1 Happy, New             2
#2 Year, to, you          3
#3 this                   1
#4 is, a , very           3
#5 strange, question      2

或者在
base R
中使用
strsplit
长度

df$Number <- lengths(strsplit(df$Outcome, ",\\s*"))
df$Number
#删除除逗号以外的所有字符并计数
nchar(gsub('[^,]','',df$结果))+1
#[1] 2 3 1 3 2

df$Number另一个基本R选项是使用
长度
+
gregexpr
,例如

transform(
  df,
  Number = lengths(gregexpr("\\w+", Outcome))
)

            Outcome Number
1        Happy, New      2
2     Year, to, you      3
3              this      1
4      is, a , very      3
5 strange, question      2

base R中的
count.fields
函数用于
read.table
等函数中,以确定生成的
data.frame
所需的列数。您也可以在这里使用它,尽管
count.fields
设计用于
文件或连接

count.fields(textConnection(df$Outcome), ",")
# [1] 2 3 1 3 2
鉴于该函数是一个经常使用的函数,它的执行效率相当高。但是,如果您正在处理一个非常大的字符串,您可能需要使用“stringi”包中的
stri\u count\u fixed

以下是一些测试:

fun_cf <- function(x = df$Outcome) count.fields(textConnection(x), ",")
fun_gs <- function(x = df$Outcome) nchar(gsub('[^,]', '', x)) + 1
fun_sc <- function(x = df$Outcome) stringr::str_count(x, ",") + 1
fun_ss <- function(x = df$Outcome) lengths(strsplit(x, ",", TRUE))
fun_scf <- function(x = df$Outcome) stringi::stri_count_fixed(x, ",") + 1

string <- rep(c(df$Outcome, paste(df$Outcome, df$Outcome, sep = ",")), 1e5)
length(string)
# [1] 1000000

bench::mark(fun_cf(string), fun_gs(string), fun_sc(string),
            fun_ss(string), fun_scf(string))
## # A tibble: 5 x 13
##   expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
##   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
## 1 fun_cf(string)  792.64ms 792.64ms     1.26     11.6MB     0        1     0
## 2 fun_gs(string)     5.28s    5.28s     0.189    19.1MB     0        1     0
## 3 fun_sc(string)  840.17ms 840.17ms     1.19     11.4MB     1.19     1     1
## 4 fun_ss(string)  830.35ms 830.35ms     1.20     11.4MB     0        1     0
## 5 fun_scf(string) 154.86ms 155.44ms     6.24     11.4MB     1.56     4     1
## # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## #   time <list>, gc <list>

fun\cf如果
之间的单词不是一个单词,可能是两个或三个单词,看起来像“这是一个奇怪的问题”?@Stataq你可以计算
df%>%变异(Number=stru count(output),”)+1
或者如果还有空格
df%>%变异(Number=stru count(output),”[,]“”+1)
count.fields(textConnection(df$Outcome), ",")
# [1] 2 3 1 3 2
fun_cf <- function(x = df$Outcome) count.fields(textConnection(x), ",")
fun_gs <- function(x = df$Outcome) nchar(gsub('[^,]', '', x)) + 1
fun_sc <- function(x = df$Outcome) stringr::str_count(x, ",") + 1
fun_ss <- function(x = df$Outcome) lengths(strsplit(x, ",", TRUE))
fun_scf <- function(x = df$Outcome) stringi::stri_count_fixed(x, ",") + 1

string <- rep(c(df$Outcome, paste(df$Outcome, df$Outcome, sep = ",")), 1e5)
length(string)
# [1] 1000000

bench::mark(fun_cf(string), fun_gs(string), fun_sc(string),
            fun_ss(string), fun_scf(string))
## # A tibble: 5 x 13
##   expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
##   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
## 1 fun_cf(string)  792.64ms 792.64ms     1.26     11.6MB     0        1     0
## 2 fun_gs(string)     5.28s    5.28s     0.189    19.1MB     0        1     0
## 3 fun_sc(string)  840.17ms 840.17ms     1.19     11.4MB     1.19     1     1
## 4 fun_ss(string)  830.35ms 830.35ms     1.20     11.4MB     0        1     0
## 5 fun_scf(string) 154.86ms 155.44ms     6.24     11.4MB     1.56     4     1
## # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## #   time <list>, gc <list>