通过部分匹配（R）连接不同数量的列_R_Data.table_Concatenation_Multiple Columns

通过部分匹配（R）连接不同数量的列

通过部分匹配（R）连接不同数量的列,r,data.table,concatenation,multiple-columns,R,Data.table,Concatenation,Multiple Columns,第一个问题，虽然我已经潜伏了一段时间！我试着尽我应有的努力，并且正在接近答案我有一个300列的数据框，我想在匹配变量名模式的基础上，将其合并为大约10列。原始数据输出为我提供了一列，其中包含主变量名（在示例中为“before”和“after”）以及一个数字。在我的“真实”数据中，每个变量大约有30个副本我想合并每个名称中有“before”或“after”等的列。我成功地使用data.table的语法为这种类型的“computed”列创建了变量“new” 但正如您所看到的，这明确表示我要合并

第一个问题，虽然我已经潜伏了一段时间！我试着尽我应有的努力，并且正在接近答案

我有一个300列的数据框，我想在匹配变量名模式的基础上，将其合并为大约10列。原始数据输出为我提供了一列，其中包含主变量名（在示例中为“before”和“after”）以及一个数字。在我的“真实”数据中，每个变量大约有30个副本

我想合并每个名称中有“before”或“after”等的列。我成功地使用data.table的语法为这种类型的“computed”列创建了变量“new”

但正如您所看到的，这明确表示我要合并的列。我想灵活地组合，这样，如果一个变量有31个副本，另一个变量有86个副本，我就不必a）知道，或者b）必须键入。我只想根据基本变量名（例如“before”）进行匹配，并合并列

我尝试使用grep进入下一个级别

> newvar2 <- paste(grep("before", colnames(myTable2), value = TRUE), collapse = "")
> newvar2
[1] "before1before2before3"

将grep步骤作为其参数，并组合名称与模式匹配的所有列？这就是我想要的：

 herenow        before_Final    after_Final
1: 0.339967856  ifandwhere      nothereblank
2: 0.818190875  forinby         throughblankblank
3: 0.223768051  andwhere        mineyoursours
4: 0.616199835  andwhere        haheyhon
5: 0.760625218  fiftheighthand  wherenotbeet
6: 0.552510532  andwherenot     fillare

我正在学习更多关于矢量化的知识，但如果我能列出我想要组合的变量类型（例如，在之前、之后、之间），然后在循环中运行这些变量类型，那就太好了！大概是

finalVarNames <- c("Before_final", "After_final", "Between_final")
whatToMatch <- c("before", "after", "between")

我知道语法不正确，可能是在value参数之前的第二个“myTable2”引用中。此代码确实成功创建了新变量，但它为空。如何将连接的grep匹配变量组放入其中

谢谢你能给予的任何帮助

您可以使用

Reduce

功能通过

grep

在

.SD

语法中指定列，将所选列粘贴在一起。以下是使用

数据获取结果的示例。表包：
library(stringi); library(data.table)
myTable2[, paste(stri_trans_totitle(whatToMatch), "final", sep = "_") := 
           lapply(whatToMatch, function(wtm) Reduce(function(x,y) paste(x, y, sep = ""), 
                                             .SD[, grep(wtm, names(myTable2)), with = F]))]

myTable2
#      herenow before1 before2 before3  after1 after2 after3   Before_final       After_final
# 1: 0.3399679      if     and   where     not   here  blank     ifandwhere      nothereblank
# 2: 0.8181909     for      in      by through  blank  blank        forinby throughblankblank
# 3: 0.2237681     and   where            mine  yours   ours       andwhere     mineyoursours
# 4: 0.6161998     and   where              ha    hey    hon       andwhere          haheyhon
# 5: 0.7606252   fifth  eighth     and   where    not   beet fiftheighthand      wherenotbeet
# 6: 0.5525105     and   where     not    fill           are    andwherenot           filler

do.call
和Reduce
的一些基准：
dim(myTable2)
# [1] 1572864       9

reduce <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) Reduce(function(x,y) paste(x, y, sep = ""), .SD[, grep(wtm, names(myTable2)), with = F]))]    
docall <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) do.call(paste, c(sep = "", .SD[, grep(wtm, names(myTable2)), with = F])))]

microbenchmark::microbenchmark(docall(), reduce(), times = 10)
# Unit: milliseconds
#     expr      min        lq      mean    median        uq       max neval
# docall() 707.7818  722.6037  767.8923  737.6272  852.4909  868.8202    10
# reduce() 999.4925 1009.5146 1026.6200 1020.4637 1046.7073 1067.7479    10

dim（myTable2）
# [1] 1572864       9
作为一个起点，请参见do.call（粘贴，c（sep=“”，myTable2[StartWith（names（myTable2），whatToMatch[i]））
c（）
没有使用sep=
参数。文件说唯一的选择是“递归”c（）
应该列一个列表，对吗？我将startsWith（names（myTable2），whatToMatch[1]）
拆分出来进行测试，它给了我一个逻辑向量，在本例中，每个列名是否以“before”开头。然后，当我把myTable2
放在括号中时，它只给出了前3行数据，所有变量都保持不变。比grep，IMO更直观。对上述评论的更正：当我把myTable2
放在括号中时，它只给了我数据的2:4行，所有变量都保持不变。我的猜测是因为它使用“真”输出作为子集的索引。sep=
是作为..
传递的c
的命名参数——例如c（sep=“”，a=2，'1！=2'=TRUE，fac=factor（1））
返回一个命名的“字符”向量，其中包含..
参数和“名称”…
的标记。我猜，您观察到的子集是因为您正在使用“逻辑”向量而不是“data.frame”对“data.table”进行子集设置。为了了解c（sep=“”，myTable2的一个子集）
正在做什么，它被传递给do.call
，请尝试将myTable2
转换为“data.frame”。如果您需要特定的“data.table”方法，还可以添加“data.table”标记。谢谢@alexis_laz。尝试将其作为data.frame连接到正确的列组！现在看起来我必须使用data.frame方法和data.table方法来分配该列。（我可以选择任何一种方式，我只是听说使用较大的文件时fread可能会更快）。我认为Reduce（paste，）
与它的等价物do.call（paste，）
相比，它的效率是不必要的，因为使用Reduce
所有中间的“字符”向量都会被反复扫描和复制，直到最后的“字符”创建。Reduce
的工作方式类似于粘贴（粘贴（粘贴（x，y，z），…）
而do.call
创建并计算粘贴（x，y，z）
调用。前者必须（1）缓存，（2）扫描，（3）复制所有中间“字符”结果，而do.call
分配适当的缓冲区一次，然后连接所有元素。此外，考虑到Q中提到的列数，基准测试的差异更为明显，如x=rep_len（list（rep_len，letters，1e5）），50；相同（Reduce（粘贴，x），do.call（粘贴，x））；microbenchmark:：microbenchmark（Reduce（paste，x），do.call（paste，x），times=25）对于基准测试输入，通常最好显示生成它的代码，而不仅仅是显示它的维度。
myTable2[, finalVarNames[i] := paste(grep(whatToMatch[i], myTable2, value = TRUE), collapse = "")]

library(stringi); library(data.table)
myTable2[, paste(stri_trans_totitle(whatToMatch), "final", sep = "_") := 
           lapply(whatToMatch, function(wtm) Reduce(function(x,y) paste(x, y, sep = ""), 
                                             .SD[, grep(wtm, names(myTable2)), with = F]))]

myTable2
#      herenow before1 before2 before3  after1 after2 after3   Before_final       After_final
# 1: 0.3399679      if     and   where     not   here  blank     ifandwhere      nothereblank
# 2: 0.8181909     for      in      by through  blank  blank        forinby throughblankblank
# 3: 0.2237681     and   where            mine  yours   ours       andwhere     mineyoursours
# 4: 0.6161998     and   where              ha    hey    hon       andwhere          haheyhon
# 5: 0.7606252   fifth  eighth     and   where    not   beet fiftheighthand      wherenotbeet
# 6: 0.5525105     and   where     not    fill           are    andwherenot           filler

dim(myTable2)
# [1] 1572864       9

reduce <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) Reduce(function(x,y) paste(x, y, sep = ""), .SD[, grep(wtm, names(myTable2)), with = F]))]    
docall <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) do.call(paste, c(sep = "", .SD[, grep(wtm, names(myTable2)), with = F])))]

microbenchmark::microbenchmark(docall(), reduce(), times = 10)
# Unit: milliseconds
#     expr      min        lq      mean    median        uq       max neval
# docall() 707.7818  722.6037  767.8923  737.6272  852.4909  868.8202    10
# reduce() 999.4925 1009.5146 1026.6200 1020.4637 1046.7073 1067.7479    10