R 高效地计算多列中字符串的出现次数
我有一个大型数据框(>400万行),其中包含存储字符串的列R 高效地计算多列中字符串的出现次数,r,dataframe,R,Dataframe,我有一个大型数据框(>400万行),其中包含存储字符串的列yname1,yname2,yname3: yname1 | yname2 | yname3 aaaaaa | bbbaaa | bbaaaa aaabbb | cccccc | aaaaaa aaaaaa | aaabbb | dddddd cccccc | dddddd | eeeeee 现在我想计算所有列中每个字符串出现的总数。这些应作为附加列添加: yname1 | yname2 | yname3 | rcount1 | rco
yname1
,yname2
,yname3
:
yname1 | yname2 | yname3
aaaaaa | bbbaaa | bbaaaa
aaabbb | cccccc | aaaaaa
aaaaaa | aaabbb | dddddd
cccccc | dddddd | eeeeee
现在我想计算所有列中每个字符串出现的总数。这些应作为附加列添加:
yname1 | yname2 | yname3 | rcount1 | rcount2 | rcount3
aaaaaa | bbbaaa | bbaaaa | 3 | 1 | 1
aaabbb | cccccc | aaaaaa | 2 | 2 | 3
aaaaaa | aaabbb | dddddd | 3 | 2 | 2
cccccc | dddddd | eeeeee | 2 | 2 | 1
我已经编写了以下代码,可以完成这项工作:
data3$rcount1 <- sapply(data3$yname1, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount2 <- sapply(data3$yname2, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount3 <- sapply(data3$yname3, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount1我认为找到每个唯一值的和然后连接到原始表会更快
all_yname <-c(df$yname1, df$yname2, df$yname3)
rcount <- as.data.frame(table(all_yname))
merge(df, rcount, by.x = 'yname1', by.y = 'all_yname')
all_-yname一个数据如何。表
方法:
library(data.table)
setDT(d)
lookup <- melt(d, measure.vars = paste0("yname", 1:3))[, .N, by = value]
# value N
#1: aaaaaa 3
#2: aaabbb 2
#3: cccccc 2
#4: bbbaaa 1
#5: dddddd 2
#6: bbaaaa 1
#7: eeeeee 1
d[, paste0("rcount", 1:3) :=
lapply(d, function(x) lookup[x, , on = .(value)][, N])]
# yname1 yname2 yname3 rcount1 rcount2 rcount3
#1: aaaaaa bbbaaa bbaaaa 3 1 1
#2: aaabbb cccccc aaaaaa 2 2 3
#3: aaaaaa aaabbb dddddd 3 2 2
#4: cccccc dddddd eeeeee 2 2 1
在base R中,可以构建data.frame的未列出值的表,并根据这些值对它们进行索引。确保索引所依据的是字符串,而不是因子(因此为.character
),否则它将被编号而不是名称索引
data.frame(df,
lapply(df, function(x){data.frame(table(unlist(df))[as.character(x)])['Freq']})
)
# yname1 yname2 yname3 Freq Freq.1 Freq.2
# 1 aaaaaa bbbaaa bbaaaa 3 1 1
# 2 aaabbb cccccc aaaaaa 2 2 3
# 3 aaaaaa aaabbb dddddd 3 2 2
# 4 cccccc dddddd eeeeee 2 2 1
如果data.frame足够大,速度较慢,则可以在lappy
之外构建表,使其只运行一次:
df_table <- table(unlist(df))
data.frame(df, lapply(df, function(x){data.frame(df_table[as.character(x)])['Freq']}))
资料
df我更喜欢上面的答案,但为了完整起见,让我添加一个备选方案,它基于使用唯一字符串作为行名:
df2 <- melt(df, id.vars = NULL)
df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame()
rownames(df2) <- df2$value
df2$value <- NULL
这个解决方案现在要快得多,尽管不如某些替代方案快。看
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df), espresso(df), times=1000);
Unit: microseconds
expr min lq mean median uq max neval
bgoldst(df) 579.447 673.956 739.9614 713.2980 759.0550 3719.153 1000
markus(df) 549.514 630.123 681.1892 655.1390 679.0870 3767.048 1000
alistaire(df) 1662.650 1796.287 1957.4346 1851.8795 1921.5840 26532.692 1000
jota(dt) 5551.147 5897.745 6333.6954 6041.8590 6283.6880 22457.746 1000
mhairi(df) 2538.450 2717.843 2990.8535 2793.1070 2910.9205 65752.067 1000
coffein(df) 1636.565 1858.936 2006.7821 1941.2555 2016.7330 4553.044 1000
espresso(df) 753.496 825.766 910.6520 865.5365 925.4055 4662.091 1000
已经有一些很好的解决方案,但是没有一个使用match()
在预先计算的频率表中查找每个字符串。下面是如何做到这一点。注意,我选择了as.matrix()
来为table()
的参数和match()
的第一个参数生成yname*
列的矩阵
可以替换为
cnts[m]
因此,根本不需要调用match()
我只是重新运行了基准测试,发现它并没有以任何显著的方式改变我的解决方案的运行时间(可能只是在小规模测试中稍微加快)。这大概是因为为具有字符名的向量编制索引在内部需要相同类型的match()
逻辑,因此上述替换不会获得任何性能。但我想说,在简洁性和简单性方面的改进是值得的
标杆管理
我应该注意到,为了产生这些基准测试结果,我对其他一些解决方案做了一些小的修改。最值得注意的是,我希望避免为重复执行复制任何输入,但由于data.tables是通过引用传递的,因此我必须修改jota()
,使其成为幂等的。这涉及到只过滤目标yname*
列,我通过grep()
调用将其预计算成名为cns
的局部变量,就像我在自己的解决方案中所做的那样。为了公平起见,我在所有解决方案中添加了相同的grep()
调用和过滤逻辑,除了markus()
,它不需要它,因为它显式地分别处理每个列。我还将jota()
中的lookup
上的索引联接操作更改为lookup[(value=x),'value']
,因为它在其他方面对我不起作用。最后,对于mhairi()
,我通过在所有yname*
列中添加一个Reduce()
调用来合并来完成解决方案
library(microbenchmark);
library(data.table);
library(dplyr);
bgoldst <- function(df) { cns <- grep(value=T,'^yname',names(df)); m <- as.matrix(df[cns]); cnts <- table(m); df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df)); df; };
markus <- function(df) { df$rcount1 <- sapply(df$yname1, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount2 <- sapply(df$yname2, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount3 <- sapply(df$yname3, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df; };
alistaire <- function(df) { cns <- grep(value=T,'^yname',names(df)); df_table <- table(unlist(df[cns])); data.frame(df[cns],lapply(df[cns],function(x){data.frame(Freq=df_table[as.character(x)])})); };
jota <- function(dt) { cns <- grep(value=T,'^yname',names(df)); lookup <- melt(dt, measure.vars = cns)[, .N, by = value]; dt[, paste0("rcount", 1:3) := lapply(dt[,cns,with=F], function(x) lookup[.(value=x), , on = 'value'][, N])]; };
mhairi <- function(df) { cns <- grep(value=T,'^yname',names(df)); all_yname <-do.call(c,df[cns]); rcount <- as.data.frame(table(all_yname)); Reduce(function(df,cn) merge(df, rcount, by.x = cn, by.y = 'all_yname'),cns,df); };
coffein <- function(df) { cns <- grep(value=T,'^yname',names(df)); df2 <- melt(df[cns], id.vars = NULL); df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame(); rownames(df2) <- df2$value; df2$value <- NULL; df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df; };
库(microbenchmark);
图书馆(数据表);
图书馆(dplyr);
bgoldst是的,如果我不跨多个列计数,这是一种简单的方法。例如:在代码中,您只计算列yname1=>2中出现的“aaaaaa”,但我想计算所有列=>3中出现的“aaaaaa”。啊,好吧!我想把所有的栏都连接起来是最容易的。请稍等,我会修改我的回答。我觉得很好。它可能是R版本的东西;它是一个表的子集,我认为是在3.3.0中添加/更改的。可以将其转换为命名向量,但需要重新排列。为了比较性能,应该使用更多的行。OP提到了超过400万行。因此,要比较解决方案,您应该将行数从4行放大到4e3到4e6范围内的某个位置。这一点很好!我看到您已经在答案中添加了一个具有适当行数的基准,而且似乎意式浓缩咖啡
仍然比咖啡
快,尽管仅从我的基准来看,这可能并不明显。“仅在开发版本中有效”应该是data.table的官方标记行
# df[] <- lapply(df, as.character) # in case they are stored as factors
df$r1 <- df2[df$yname1,]
df$r2 <- df2[df$yname2,]
df$r3 <- df2[df$yname3,]
> df
yname1 yname2 yname3 r1 r2 r3
1 aaaaaa bbbaaa bbaaaa 3 1 1
2 aaabbb cccccc aaaaaa 2 2 3
3 aaaaaa aaabbb dddddd 3 2 2
4 cccccc dddddd eeeeee 2 2 1
df2 <- data.frame(table(unlist(df)), row.names = 1)
df$r1 <- df2[df$yname1,]
df$r2 <- df2[df$yname2,]
df$r3 <- df2[df$yname3,]
espresso <- function(df) {
cns <- grep(value=T,'^yname',names(df));
df2 <- data.frame(table(unlist(df[cns])), row.names = 1)
df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df
};
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df), espresso(df), times=1000);
Unit: microseconds
expr min lq mean median uq max neval
bgoldst(df) 579.447 673.956 739.9614 713.2980 759.0550 3719.153 1000
markus(df) 549.514 630.123 681.1892 655.1390 679.0870 3767.048 1000
alistaire(df) 1662.650 1796.287 1957.4346 1851.8795 1921.5840 26532.692 1000
jota(dt) 5551.147 5897.745 6333.6954 6041.8590 6283.6880 22457.746 1000
mhairi(df) 2538.450 2717.843 2990.8535 2793.1070 2910.9205 65752.067 1000
coffein(df) 1636.565 1858.936 2006.7821 1941.2555 2016.7330 4553.044 1000
espresso(df) 753.496 825.766 910.6520 865.5365 925.4055 4662.091 1000
cns <- grep(value=T,'^yname',names(df));
m <- as.matrix(df[cns]);
cnts <- table(m);
df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df));
df;
## yname1 yname2 yname3 rcount1 rcount2 rcount3
## 1 aaaaaa bbbaaa bbaaaa 3 1 1
## 2 aaabbb cccccc aaaaaa 2 2 3
## 3 aaaaaa aaabbb dddddd 3 2 2
## 4 cccccc dddddd eeeeee 2 2 1
cnts[match(m,names(cnts))]
cnts[m]
library(microbenchmark);
library(data.table);
library(dplyr);
bgoldst <- function(df) { cns <- grep(value=T,'^yname',names(df)); m <- as.matrix(df[cns]); cnts <- table(m); df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df)); df; };
markus <- function(df) { df$rcount1 <- sapply(df$yname1, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount2 <- sapply(df$yname2, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount3 <- sapply(df$yname3, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df; };
alistaire <- function(df) { cns <- grep(value=T,'^yname',names(df)); df_table <- table(unlist(df[cns])); data.frame(df[cns],lapply(df[cns],function(x){data.frame(Freq=df_table[as.character(x)])})); };
jota <- function(dt) { cns <- grep(value=T,'^yname',names(df)); lookup <- melt(dt, measure.vars = cns)[, .N, by = value]; dt[, paste0("rcount", 1:3) := lapply(dt[,cns,with=F], function(x) lookup[.(value=x), , on = 'value'][, N])]; };
mhairi <- function(df) { cns <- grep(value=T,'^yname',names(df)); all_yname <-do.call(c,df[cns]); rcount <- as.data.frame(table(all_yname)); Reduce(function(df,cn) merge(df, rcount, by.x = cn, by.y = 'all_yname'),cns,df); };
coffein <- function(df) { cns <- grep(value=T,'^yname',names(df)); df2 <- melt(df[cns], id.vars = NULL); df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame(); rownames(df2) <- df2$value; df2$value <- NULL; df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df; };
## OP's test case
df <- data.frame(yname1=c('aaaaaa','aaabbb','aaaaaa','cccccc'),yname2=c('bbbaaa','cccccc','aaabbb','dddddd'),yname3=c('bbaaaa','aaaaaa','dddddd','eeeeee'),stringsAsFactors=F);
dt <- as.data.table(df);
ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,ex,y)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst(df) 491.373 544.6165 597.4743 575.8350 609.192 2054.872 100
## markus(df) 375.907 435.5645 463.7258 467.4250 489.022 549.962 100
## alistaire(df) 754.380 816.1755 849.8749 840.3385 888.021 959.654 100
## jota(dt) 4143.955 4425.7785 4741.8354 4656.2835 4854.928 7347.930 100
## mhairi(df) 1938.122 2047.1740 2182.1841 2135.4850 2209.896 3969.045 100
## coffein(df) 1286.380 1430.9265 1546.3245 1511.3255 1562.430 3319.441 100
## scale test
set.seed(1L);
NR <- 4e3L; NC <- 3L; SL <- 6L;
df <- as.data.frame(setNames(nm=paste0('yname',seq_len(NC)),replicate(NC,do.call(paste0,replicate(SL,sample(letters,NR,T),simplify=F)),simplify=F)),stringsAsFactors=F);
dt <- as.data.table(df);
ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,y,ex)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df),times=3L);
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(df) 85.20766 87.00487 88.39154 88.80209 89.98348 91.16487 3
## markus(df) 3771.08606 3788.97413 3799.08405 3806.86220 3813.08305 3819.30390 3
## alistaire(df) 83.03348 83.10276 83.18116 83.17204 83.25500 83.33797 3
## jota(dt) 12.49174 13.82088 14.44939 15.15002 15.42821 15.70640 3
## mhairi(df) 156.06459 156.36608 158.27256 156.66758 159.37654 162.08551 3
## coffein(df) 154.02853 154.97215 156.52246 155.91576 157.76942 159.62309 3