R 如何计算“数量”&引用;每列
我有一个奇怪的问题。如果我有一些句子,我想计算每个句子中有多少个“,”,新变量R 如何计算“数量”&引用;每列,r,R,我有一个奇怪的问题。如果我有一些句子,我想计算每个句子中有多少个“,”,新变量number等于number of,+1。我该怎么做?看起来像这样的东西: 可以使用以下代码生成示例数据: df<-structure(list(Outcome = c("Happy, New", "Year, to, you", "this", "is, a , very", "strange, question&qu
number
等于number of,+1
。我该怎么做?看起来像这样的东西:
可以使用以下代码生成示例数据:
df<-structure(list(Outcome = c("Happy, New", "Year, to, you", "this",
"is, a , very", "strange, question")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
df使用stru count
计算单词数更容易
library(stringr)
library(dplyr)
df %>%
mutate(Number = str_count(Outcome, "\\w+"))
-输出
# A tibble: 5 x 2
# Outcome Number
# <chr> <int>
#1 Happy, New 2
#2 Year, to, you 3
#3 this 1
#4 is, a , very 3
#5 strange, question 2
或者在base R
中使用strsplit
和长度
df$Number <- lengths(strsplit(df$Outcome, ",\\s*"))
df$Number#删除除逗号以外的所有字符并计数
nchar(gsub('[^,]','',df$结果))+1
#[1] 2 3 1 3 2
df$Number另一个基本R选项是使用长度+gregexpr
,例如
transform(
df,
Number = lengths(gregexpr("\\w+", Outcome))
)
给
Outcome Number
1 Happy, New 2
2 Year, to, you 3
3 this 1
4 is, a , very 3
5 strange, question 2
base R中的count.fields
函数用于read.table
等函数中,以确定生成的data.frame
所需的列数。您也可以在这里使用它,尽管count.fields
设计用于文件或连接
count.fields(textConnection(df$Outcome), ",")
# [1] 2 3 1 3 2
鉴于该函数是一个经常使用的函数,它的执行效率相当高。但是,如果您正在处理一个非常大的字符串,您可能需要使用“stringi”包中的stri\u count\u fixed
以下是一些测试:
fun_cf <- function(x = df$Outcome) count.fields(textConnection(x), ",")
fun_gs <- function(x = df$Outcome) nchar(gsub('[^,]', '', x)) + 1
fun_sc <- function(x = df$Outcome) stringr::str_count(x, ",") + 1
fun_ss <- function(x = df$Outcome) lengths(strsplit(x, ",", TRUE))
fun_scf <- function(x = df$Outcome) stringi::stri_count_fixed(x, ",") + 1
string <- rep(c(df$Outcome, paste(df$Outcome, df$Outcome, sep = ",")), 1e5)
length(string)
# [1] 1000000
bench::mark(fun_cf(string), fun_gs(string), fun_sc(string),
fun_ss(string), fun_scf(string))
## # A tibble: 5 x 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
## 1 fun_cf(string) 792.64ms 792.64ms 1.26 11.6MB 0 1 0
## 2 fun_gs(string) 5.28s 5.28s 0.189 19.1MB 0 1 0
## 3 fun_sc(string) 840.17ms 840.17ms 1.19 11.4MB 1.19 1 1
## 4 fun_ss(string) 830.35ms 830.35ms 1.20 11.4MB 0 1 0
## 5 fun_scf(string) 154.86ms 155.44ms 6.24 11.4MB 1.56 4 1
## # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## # time <list>, gc <list>
fun\cf如果,
之间的单词不是一个单词,可能是两个或三个单词,看起来像“这是一个奇怪的问题”?@Stataq你可以计算,
即df%>%变异(Number=stru count(output),”)+1
或者如果还有空格df%>%变异(Number=stru count(output),”[,]“”+1)
count.fields(textConnection(df$Outcome), ",")
# [1] 2 3 1 3 2
fun_cf <- function(x = df$Outcome) count.fields(textConnection(x), ",")
fun_gs <- function(x = df$Outcome) nchar(gsub('[^,]', '', x)) + 1
fun_sc <- function(x = df$Outcome) stringr::str_count(x, ",") + 1
fun_ss <- function(x = df$Outcome) lengths(strsplit(x, ",", TRUE))
fun_scf <- function(x = df$Outcome) stringi::stri_count_fixed(x, ",") + 1
string <- rep(c(df$Outcome, paste(df$Outcome, df$Outcome, sep = ",")), 1e5)
length(string)
# [1] 1000000
bench::mark(fun_cf(string), fun_gs(string), fun_sc(string),
fun_ss(string), fun_scf(string))
## # A tibble: 5 x 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
## 1 fun_cf(string) 792.64ms 792.64ms 1.26 11.6MB 0 1 0
## 2 fun_gs(string) 5.28s 5.28s 0.189 19.1MB 0 1 0
## 3 fun_sc(string) 840.17ms 840.17ms 1.19 11.4MB 1.19 1 1
## 4 fun_ss(string) 830.35ms 830.35ms 1.20 11.4MB 0 1 0
## 5 fun_scf(string) 154.86ms 155.44ms 6.24 11.4MB 1.56 4 1
## # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## # time <list>, gc <list>