R data.table加速SI/公制转换
情况是这样的。我有一个8500万行的表,有18列。其中三列的值采用公制前缀/SI表示法(参见维基百科) 这意味着我有如下数字:R data.table加速SI/公制转换,r,performance,data.table,metric,R,Performance,Data.table,Metric,情况是这样的。我有一个8500万行的表,有18列。其中三列的值采用公制前缀/SI表示法(参见维基百科) 这意味着我有如下数字: .1M而不是100000或1e+5,或 1K而不是1000或1e+3 示例data.table是 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 1: 2014-03-25 12:15:12 58300 3010 44.0 4.5 0.0 0
- .1M而不是100000或1e+5,或
- 1K而不是1000或1e+3
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1: 2014-03-25 12:15:12 58300 3010 44.0 4.5 0.0 0 0 0.8 50 0.8 10K 303 21K 0 a 56
2: 2014-03-25 12:15:12 56328 3010 28.0 12.0 0.0 0 0 0.3 60 0.0 59 62 .1M 0 a 66
3: 2014-03-25 12:15:12 21082 3010 10.0 1.7 0.0 0 0 14.0 72 0.3 4K 208 8K 1 a 80
4: 2014-03-25 12:15:12 59423 3010 12.0 0.0 0.2 0 0 88.0 0 0.0 20 16 71 0 a 26
5: 2014-03-25 12:15:12 59423 3010 9.6 1.4 0.0 0 0 60.0 29 0.2 2K 251 6K 0 a 56
6: 2014-03-25 12:15:12 24193 3010 8.3 1.9 0.0 0 0 9.9 80 0.3 3K 264 8K 1 a 71
7: 2014-03-25 12:15:12 21082 3010 7.1 1.7 0.4 0 0 6.3 83 0.3 3K 197 7K 0 a 71
8: 2014-03-25 12:15:12 59423 3010 4.6 1.2 0.0 0 0 57.0 37 0.1 998 81 7K 0 a 118
我修改了Hans-Jörg Bibiko编写的一个函数,他用它来修改ggplot2量表。如果你休息好了,请查看网站。我最终使用的功能是:
sitor <- function(x)
{
conv <- paste("E", c(seq(-24 ,-3, by=3), -2, -1, 0, seq(3, 24, by=3)), sep="")
names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","","K","M","G","T","P","E","Z","Y")
x <- as.character(x)
num <- function(x) as.numeric(
paste(
strsplit(x,"[A-z|µ]")[[1]][3],
ifelse(substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1) == "",
"",
conv[substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1)]
),
sep=""
)
)
return(lapply(x,num))
}
我已将data.table键向量应用于带有
setkeyv(temp,c("V1","V2","V3","V18"))
任何61分钟后,我仍然在这里等待结果。。。考虑到我的数据量将增长4到5倍,一些关于如何加速转换的提示将非常有用。为什么不试试
sitools
库呢
library(data.table)
dt<-data.table(var = sample(x=1:1e5, size=1e6, replace=T))
library(sitools)
> system.time(dt[, var2 := f2si(var)])
user system elapsed
10.08 0.09 10.89
这是一种方法,在我的计算机上大约需要10秒来转换一个具有10米值的向量。您可以将其扩展到“K”、“M”和“G”以上
>f_conv从顶部输出:PID用户PR NI VIRT RES SHR S%CPU%MEM TIME+命令
4878 neurozen 20 0 18.7g 18g 11m R 100.1 62.8 63:38.95 rsession
什么是跑61分钟<代码>设置键v(temp,c(“V1”、“V2”、“V3”、“V18”))
或temp[,
:=(V13=sitor(V13)、V14=sitor(V14)、V15=sitor(V15))]
?为什么要对临时文件进行排序?sitor
返回一个列表。。。您的列是否在类型列表的dt中?@Dave使用temp[,:=(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15))]将sitor
函数应用于temp
,需要61分钟(实际上仍在进行)。我不打算对temp进行排序,它只是在应用了键之后setkeyv
ontemp
需要约90秒。@Michele说得不错。我正在应用的sitor
列的类是character
class。您是否建议我不应该在函数的返回值中执行lappy
?告诉我更多…哇,太棒了。我要试一试!我无法为我的R版本安装scitools软件包:“'scitools'不可用(对于R版本3.0.3)`@neurozen it'ssitools
@Michelle Thank heaps。我起床时间太长了。f2sci
功能很好,但实际上我想做相反的事情。正如我上面的示例数据,它有K和M表示基洛(1e3),M表示(1e6),我希望它是一个数字(带指数或其他),我会试试看@Michelle@Data Munger FYI我还想到了使用grep
来选择仅包含公制单位的行temp[grep(“[KMG]$”,V1),V1:=sitor(V1)])。问题是您必须返回字符,然后在列中再次以.numeric的形式运行。我尝试了一下,结果很好。对一个包含8370万行的数据表的3列运行需要237.6秒。非常感谢。
library(data.table)
dt<-data.table(var = sample(x=1:1e5, size=1e6, replace=T))
library(sitools)
> system.time(dt[, var2 := f2si(var)])
user system elapsed
10.08 0.09 10.89
si2f<-function(x){
if(is.numeric(x)) return(x)
require(data.table)
dt<-data.table(lab=c("y","z","a","f","p","n","µ","m","c","d","", "da", "h", "k","M","G","T","P","E","Z","Y"),
mul=c(1e-24, 1e-21, 1e-18, 1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e-2, 1e-1, 1L, 10L, 1e2, 1e3, 1e6, 1e9, 1e12, 1e15, 1e18, 1e21, 1e24),
key="lab")
res<-as.numeric(gsub("[^0-9|\\.]","", x))
x<-gsub("[0-9]|\\s+|\\.","", x)
.subset2(dt[.(x)], "mul")*res
}
> system.time(dt[, var3 := si2f(var2)])
user system elapsed
13.18 0.03 13.31
> dt[, all.equal(var,var3)]
[1] TRUE
> f_conv <- function(val){
+ # create matrix indexed by name for exponent
+ key <- c(Zero = ""
+ , K = "E3"
+ , M = "E6"
+ , G = "E9"
+ )
+ # extract where the original exponent is
+ indx <- regexpr("[KMG]", val)
+ # extract the exponent
+ exp <- substring(val, indx)
+ # if there was none, the use "Zero"
+ exp[indx == -1L] <- "Zero"
+ # put fake length
+ indx[indx == -1L] <- 20L
+ # do the conversion
+ as.numeric(paste0(substring(val, 1L, indx - 1L)
+ , key[exp]
+ )
+ )
+ }
>
> # test data
> n <- 10000000
> result <- paste0(sample(1:999, n, TRUE)
+ , sample(c("K", "M", "G", ""), n, TRUE)
+ )
>
> system.time(x <- f_conv(result))
user system elapsed
8.48 0.13 8.63
> cbind(result[1:50], x[1:50])
[,1] [,2]
[1,] "562K" "562000"
[2,] "946" "946"
[3,] "313G" "313000000000"
[4,] "538M" "538000000"
[5,] "697K" "697000"
[6,] "486G" "486000000000"
[7,] "814G" "814000000000"
[8,] "842" "842"
[9,] "993M" "993000000"
[10,] "440K" "440000"
[11,] "435G" "435000000000"
[12,] "407M" "407000000"
[13,] "919K" "919000"
[14,] "840" "840"
[15,] "766G" "766000000000"
[16,] "977" "977"
[17,] "139" "139"
[18,] "195G" "195000000000"
[19,] "609M" "609000000"
[20,] "69" "69"
[21,] "147M" "147000000"
[22,] "104M" "104000000"
[23,] "509K" "509000"
[24,] "951M" "951000000"
[25,] "278" "278"
[26,] "797G" "797000000000"
[27,] "106K" "106000"
[28,] "667K" "667000"
[29,] "521K" "521000"
[30,] "9" "9"
[31,] "17K" "17000"
[32,] "673M" "673000000"