R 将data.table列快速连接到一个字符串列中_R_Data.table_Concatenation

R 将data.table列快速连接到一个字符串列中

R 将data.table列快速连接到一个字符串列中,r,data.table,concatenation,R,Data.table,Concatenation,给定data.table中的任意列名列表，我想将这些列的内容连接到存储在新列中的单个字符串中。我需要连接的列并不总是相同的，因此我需要动态地生成表达式我暗自怀疑，我使用eval（parse（…）调用的方式可能会被更优雅的东西所取代，但下面的方法是迄今为止我能够实现的最快的方法对于1000万行，这个示例数据大约需要21.7秒（baserpaste0需要稍长一点——23.6秒）。我的实际数据有18-20列被连接起来，最多有1亿行，因此减速变得更加不切实际有什么想法可以加快速度吗？现行方法

给定

data.table

中的任意列名列表，我想将这些列的内容连接到存储在新列中的单个字符串中。我需要连接的列并不总是相同的，因此我需要动态地生成表达式

我暗自怀疑，我使用

eval（parse（…）

调用的方式可能会被更优雅的东西所取代，但下面的方法是迄今为止我能够实现的最快的方法

对于1000万行，这个示例数据大约需要21.7秒（baser

paste0

需要稍长一点——23.6秒）。我的实际数据有18-20列被连接起来，最多有1亿行，因此减速变得更加不切实际

有什么想法可以加快速度吗？

现行方法然后用于将列与以下表达式连接起来：

DT[,State := eval(parse(text = PasteStatement))]

输出样本：

     x   y a b c d e f        State
1: foo bar 4 8 3 6 9 2 foo483692bar
2: foo bar 8 4 8 7 8 4 foo848784bar
3: foo bar 2 6 2 4 3 5 foo262435bar
4: foo bar 2 4 2 4 9 9 foo242499bar
5: foo bar 5 9 8 7 2 7 foo598727bar

分析结果

更新1:

fread

、

fwrite

和

sed

按照@Gregor的建议，尝试使用

sed

在磁盘上进行连接。多亏了data.table的极快的

fread

和

fwrite

函数，我能够将列写入磁盘，使用sed消除逗号分隔符，然后在大约18.3秒的时间内读回后处理输出——速度还不足以进行切换，但仍然是一个有趣的切线

ConcatCols <- c("x","a","b","c","d","e","f","y") fwrite(DT[,..ConcatCols],"/home/xxx/DT.csv") system("sed 's/,//g' /home/xxx/DT.csv > /home/xxx/DT_Post.csv ") Post <- fread("/home/xxx/DT_Post.csv") DT[,State := Post[[1]]]
生成的语句如下所示：

> cat(stri_joinStatement) stri_join(a,b,c,d,e,f,a,b,c,d,e,f,a,b,c,d,e,f, sep="", collapse=NULL, ignore_null=TRUE) > cat(sprintfStatement) sprintf('%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i', a,b,c,d,e,f,a,b,c,d,e,f,a,b,c,d,e,f)

更新3:
R
不一定要慢。根据@Martin Modrák的回答，我根据一些
数据整理了一个小马驹包。表专门用于专门的“一位数整数”案例的内部结构：fastConcat 。（短期内不要在CRAN上寻找它，但通过从github repo安装，您可以自担风险使用它。）如果有人能更好地理解c ，这一点可能会得到进一步的改进，但现在，它在2.5秒内运行与更新2中相同的情况，大约8x比sprintf（）快，11.5x比stringi:：stri_c（）快我最初使用的方法对我来说，这突出了在R 中一些最简单的操作（如使用更好的c 的基本字符串向量串联）上提高性能的巨大机会。我想像@Matt Dowle这样的人已经看到这一点很多年了——如果他有时间重新编写所有的R ，而不仅仅是data.frame就好了 C救援从data.table中窃取一些代码，我们可以编写一个工作速度更快的C函数（并且可以并行化，甚至更快）首先确保你有一个工作的C++工具链： library(inline) fx <- inline::cfunction( signature(x = "integer", y = "numeric" ) , ' return ScalarReal( INTEGER(x)[0] * REAL(y)[0] ) ; ' ) fx( 2L, 5 ) #Should return 10 我不知道样本数据对于您的实际数据有多大的代表性，但是对于您的样本数据，您可以通过只连接一次而不是多次concatcol的每个唯一组合来实现显著的性能改进这意味着对于样本数据，如果你也做了所有的重复，你会看到大约500k个串联，而不是1000万个串联请参阅以下代码和计时示例： system.time({ setkeyv(DT, ConcatCols) DTunique <- unique(DT[, ConcatCols, with=FALSE], by = key(DT)) DTunique[, State := do.call(paste, c(DTunique, sep = ""))] DT[DTunique, State := i.State, on = ConcatCols] }) # user system elapsed # 7.448 0.462 4.618 这使用包tidyr 中的unite 。可能不是最快的，但可能比手工编码的R码快 library(tidyr) system.time( DNew <- DT %>% unite(State, ConcatCols, sep = "", remove = FALSE) ) # user system elapsed # 14.974 0.183 15.343 DNew[1:10] # State x y a b c d e f # 1: foo211621bar foo bar 2 1 1 6 2 1 # 2: foo532735bar foo bar 5 3 2 7 3 5 # 3: foo965776bar foo bar 9 6 5 7 7 6 # 4: foo221284bar foo bar 2 2 1 2 8 4 # 5: foo485976bar foo bar 4 8 5 9 7 6 # 6: foo566778bar foo bar 5 6 6 7 7 8 # 7: foo892636bar foo bar 8 9 2 6 3 6 # 8: foo836672bar foo bar 8 3 6 6 7 2 # 9: foo963926bar foo bar 9 6 3 9 2 6 # 10: foo385216bar foo bar 3 8 5 2 1 6 library（tidyr）系统时间( DNew%unite（州政府，美国国家石油公司，sep=“”，删除=FALSE） ) #用户系统运行时间 # 14.974 0.183 15.343 DNew[1:10] #州x y a b c d e f #1:foo211621bar foo bar 2 1 6 2 1 #2:foo532735bar foo bar 5 3 2 7 3 5 #3:foo965776巴foo巴9 6 5 7 6 #4:foo221284bar foo bar 2 2 1 2 8 4 #5:foo485976巴foo巴4 8 5 9 7 6 #6:Foo566778巴FooBar 5 6 7 8 #7:foo892636bar foo bar 8 9 2 6 3 6 #8:foo836672bar foo bar 8 3 6 6 7 2 #9:foo963926bar foo bar 9 6 3 9 2 6 #10:Foo385216酒吧FooBar3 8 5 2 1 6 <代码> > <代码> > Srutyc >确实是一个C++函数来连接字符串。我不认为你能在R中击败它的性能。即使是paste 也能很快地进入编译代码，因此它的性能几乎一样好。也许你可以使用命令行工具对数据进行预处理或后处理？或者以SQL或Hadoop或以任何方式加载数据？有几种想法：（a）在从Hadoop提取数据时合并列。Hive、Pig和Spark都支持列串联（据我所知）。（b）不幸的是，fread 不允许使用空分隔符，但readr:：write_delim 将允许使用空分隔符。这可能太慢了，但值得一试。（c） sed 可能是命令行中最快的一种，但是建议您可以使用不同的语法来提高速度，特别是如果您复制文件而不是就地编辑文件的话。（d）不知道这是否可行，但是在fwrite 中，它看起来像是一行输入检查，阻止您将“ 指定为分隔符。您可以尝试使用fixInNamespace 删除该行，然后查看它是否允许您使用sep=”“ 编写。我以前从未使用过fixInNamespace ，但这应该是可以做到的。悬而未决的问题是sep 不是空字符串是否有更深层次的原因。请提交FR以支持sep=“” imo。感谢您所做的一切努力！按原样运行您的代码我得到了8.5秒的运行时间，使用sprintf（）将运行速度从基线20.5秒提高了2.4倍。我目前正试图一行一行地工作，试图了解每一件作品都在做什么，但似乎这里有一些真正坚实的潜力！我可以试试 library(inline) fx <- inline::cfunction( signature(x = "integer", y = "numeric" ) , ' return ScalarReal( INTEGER(x)[0] * REAL(y)[0] ) ; ' ) fx( 2L, 5 ) #Should return 10 library(inline) library(data.table) library(stringi) header <- " //Taken from https://github.com/Rdatatable/data.table/blob/master/src/fwrite.c static inline void reverse(char *upp, char *low) { upp--; while (upp>low) { char tmp = *upp; *upp = *low; *low = tmp; upp--; low++; } } void writeInt32(int *col, size_t row, char **pch) { char *ch = *pch; int x = col[row]; if (x == INT_MIN) { *ch++ = 'N'; *ch++ = 'A'; } else { if (x<0) { *ch++ = '-'; x=-x; } // Avoid log() for speed. Write backwards then reverse when we know how long. char *low = ch; do { *ch++ = '0'+x%10; x/=10; } while (x>0); reverse(ch, low); } *pch = ch; } //end of copied code " worker_fun <- inline::cfunction( signature(x = "list", preallocated_target = "character", columns = "integer", start_row = "integer", end_row = "integer"), includes = header , " const size_t _start_row = INTEGER(start_row)[0] - 1; const size_t _end_row = INTEGER(end_row)[0]; const int max_out_len = 256 * 256; //max length of the final string char buffer[max_out_len]; const size_t num_elements = _end_row - _start_row; const size_t num_columns = LENGTH(columns); const int * _columns = INTEGER(columns); for(size_t i = _start_row; i < _end_row; ++i) { char *buf_pos = buffer; for(size_t c = 0; c < num_columns; ++c) { if(c > 0) { buf_pos[0] = ','; ++buf_pos; } writeInt32(INTEGER(VECTOR_ELT(x, _columns[c] - 1)), i, &buf_pos); } SET_STRING_ELT(preallocated_target,i, mkCharLen(buffer, buf_pos - buffer)); } return preallocated_target; " ) #Test with the same data RowCount <- 5e6 DT <- data.table(x = "foo", y = "bar", a = sample.int(9, RowCount, TRUE), b = sample.int(9, RowCount, TRUE), c = sample.int(9, RowCount, TRUE), d = sample.int(9, RowCount, TRUE), e = sample.int(9, RowCount, TRUE), f = sample.int(9, RowCount, TRUE)) ## Generate an expression to paste an arbitrary list of columns together ConcatCols <- list("a","b","c","d","e","f") ## Do it 3x as many times ConcatCols <- c(ConcatCols,ConcatCols,ConcatCols) ptm <- proc.time() preallocated_target <- character(RowCount) column_indices <- sapply(ConcatCols, FUN = function(x) { which(colnames(DT) == x )}) x <- worker_fun(DT, preallocated_target, column_indices, as.integer(1), as.integer(RowCount)) DT[, State := preallocated_target] proc.time() - ptm no_cores <- detectCores() # Initiate cluster cl <- makeCluster(no_cores) #Preallocated target and prepare params num_elements <- length(DT[[1]]) preallocated_target <- character(num_elements) block_size <- 4096 #No of rows processed at once. Adjust for best performance column_indices <- sapply(ConcatCols, FUN = function(x) { which(colnames(DT) == x )}) num_blocks <- ceiling(num_elements / block_size) clusterExport(cl, c("DT","preallocated_target","column_indices","num_elements", "block_size")) clusterEvalQ(cl, <CODE TO LOAD THE NATIVE FUNCTION HERE>) parLapply(cl, 1:num_blocks , function(block_id) { throw_away <- worker_fun(DT, preallocated_target, columns, (block_id - 1) * block_size + 1, min(num_elements, block_id * block_size - 1)) return(NULL) }) stopCluster(cl) system.time({ setkeyv(DT, ConcatCols) DTunique <- unique(DT[, ConcatCols, with=FALSE], by = key(DT)) DTunique[, State := do.call(paste, c(DTunique, sep = ""))] DT[DTunique, State := i.State, on = ConcatCols] }) # user system elapsed # 7.448 0.462 4.618 setkeyv(DT, ConcatCols) system.time({ DTunique <- unique(DT[, ConcatCols, with=FALSE], by = key(DT)) DTunique[, State := do.call(paste, c(DTunique, sep = ""))] DT[DTunique, State := i.State, on = ConcatCols] }) # user system elapsed # 2.526 0.280 2.181 library(tidyr) system.time( DNew <- DT %>% unite(State, ConcatCols, sep = "", remove = FALSE) ) # user system elapsed # 14.974 0.183 15.343 DNew[1:10] # State x y a b c d e f # 1: foo211621bar foo bar 2 1 1 6 2 1 # 2: foo532735bar foo bar 5 3 2 7 3 5 # 3: foo965776bar foo bar 9 6 5 7 7 6 # 4: foo221284bar foo bar 2 2 1 2 8 4 # 5: foo485976bar foo bar 4 8 5 9 7 6 # 6: foo566778bar foo bar 5 6 6 7 7 8 # 7: foo892636bar foo bar 8 9 2 6 3 6 # 8: foo836672bar foo bar 8 3 6 6 7 2 # 9: foo963926bar foo bar 9 6 3 9 2 6 # 10: foo385216bar foo bar 3 8 5 2 1 6