重复行三次,并根据R中的预定格式替换标题

重复行三次,并根据R中的预定格式替换标题,r,R,我想我不能在标题中很好地解释这一点。因此,给出了一个示例!我有10公里以上的记录是这样的 Data <- data.table( Time= sample(1:50), Values = sample(1:50), Locations= sample(c("PlaceA","PlaceB","PlaceC"),50 , replace= TRUE), TitlesFormat1= sample(c("TitleA", "TitleB","TitleC"), 50, repl

我想我不能在标题中很好地解释这一点。因此,给出了一个示例!我有10公里以上的记录是这样的

Data <- data.table(
  Time= sample(1:50),
  Values = sample(1:50),
  Locations= sample(c("PlaceA","PlaceB","PlaceC"),50 , replace= TRUE),
  TitlesFormat1= sample(c("TitleA", "TitleB","TitleC"), 50, replace = TRUE),
  key=c("TitlesFormat1,Time")
)

Data$TitlesFormat2<-paste0(Data$TitlesFormat1,"_(topic)")

Data$TitlesFormat3<-paste0(Data$TitlesFormat1,"_(",Data$Locations,"_topic)")

head(Data)

 Time Values Locations TitlesFormat1   TitlesFormat2      TitlesFormat3
   2     49    PlaceC      TitleA     TitleA_(topic) TitleA_(PlaceC_topic)
   6     41    PlaceA      TitleA     TitleA_(topic) TitleA_(PlaceA_topic)
   8     40    PlaceA      TitleA     TitleA_(topic) TitleA_(PlaceA_topic)
  13     15    PlaceB      TitleA     TitleA_(topic) TitleA_(PlaceB_topic)
  14     11    PlaceC      TitleA     TitleA_(topic) TitleA_(PlaceC_topic)
  18     17    PlaceC      TitleA     TitleA_(topic) TitleA_(PlaceC_topic)
有什么建议吗?
提前感谢您的帮助

在我们得出答案之前,有几点:

  • 如果您使用的是
    样本
    ,请使用
    set.seed(.)
    ,以确保再现性

  • data.table
    中惯用的方法是使用
    :=
    操作符通过引用添加/更新列。否则,使用
    data.table
    s没有任何好处


您可以使用
melt.data.table
函数将data.table转换为长格式:

require(reshape2)
ans <- melt(Data, id=1:3)[, variable := NULL]
#      Time Values Locations                 value
#   1:    3     44    PlaceA                TitleA
#   2:    7     15    PlaceC                TitleA
#   3:   12      3    PlaceC                TitleA
#   4:   13      7    PlaceA                TitleA
#   5:   15     13    PlaceC                TitleA
#  ---                                            
# 146:   43     36    PlaceB TitleC_(PlaceB_topic)
# 147:   44     46    PlaceB TitleC_(PlaceB_topic)
# 148:   46      6    PlaceC TitleC_(PlaceC_topic)
# 149:   48     29    PlaceC TitleC_(PlaceC_topic)
# 150:   50     11    PlaceB TitleC_(PlaceB_topic)

根据@Ananda的基准测试,我意识到我们可以通过简单的
order()
而不是在这里使用
.SD

ans[order(Time, Values, Locations)]
DT中的
order()
经过优化,可以使用
数据。表
的快速排序(从v1.9.3+开始),因此这应该比以前的
.SD
版本快得多

以下是更新的时间安排:

# Unit: milliseconds
#     expr        min         lq    median        uq       max neval
#  fun1a()  137.79719  154.68321  210.5660  242.4496  565.8980    50
#  fun2a()   92.80878   96.90226  139.4311  166.3089  472.6021    50
#  fun1b()  750.38312  828.79247  855.2852  940.3480 1151.7485    50
#  fun2b() 1059.37594 1238.60744 1332.6860 1417.6680 1502.5817    50
#  fun2c()  474.23736  543.14490  580.7551  623.6124  819.4660    50
其中
fun2c()
是:

fun2c <- function() {
    melt(Data, id=1:3)[, variable := NULL][order(Time,Values,Locations)]
}

fun2c这里是另一个使用“dplyr”+“tidyr”的替代方法。这更多是为了提供多样性,但它的性能也相当不错。基准测试还表明,稍后重新排列行顺序实际上是一个相当昂贵的操作

方法如下:

library(dplyr)
library(tidyr)

Data %>%
  gather(Var, Val, TitlesFormat1:TitlesFormat3) %>%
  group_by(Time, Values, Locations) %>%
  select(-Var)

在更大的集合上进行测试,以下是一些示例数据:

set.seed(1)
n <- 1000000
Data <- data.table(
  Time = sample(n),
  Values = sample(n),
  Locations = sample(c("PlaceA","PlaceB","PlaceC"), n, TRUE),
  TitlesFormat1 = sample(c("TitleA", "TitleB","TitleC"), n, TRUE),
  key = "TitlesFormat1,Time"
)

Data$TitlesFormat2 <- paste0(Data$TitlesFormat1, "_(topic)")

Data$TitlesFormat3 <- paste0(Data$TitlesFormat1,
                             "_(",Data$Locations,"_topic)")

非常感谢你的建议!我是新手,所以你的回答真的很有帮助!我肯定会注意到你提出的两点!:)
library(dplyr)
library(tidyr)

Data %>%
  gather(Var, Val, TitlesFormat1:TitlesFormat3) %>%
  group_by(Time, Values, Locations) %>%
  select(-Var)
set.seed(1)
n <- 1000000
Data <- data.table(
  Time = sample(n),
  Values = sample(n),
  Locations = sample(c("PlaceA","PlaceB","PlaceC"), n, TRUE),
  TitlesFormat1 = sample(c("TitleA", "TitleB","TitleC"), n, TRUE),
  key = "TitlesFormat1,Time"
)

Data$TitlesFormat2 <- paste0(Data$TitlesFormat1, "_(topic)")

Data$TitlesFormat3 <- paste0(Data$TitlesFormat1,
                             "_(",Data$Locations,"_topic)")
fun1a <- function() {
  Data %>%
    gather(Var, Val, TitlesFormat1:TitlesFormat3) %>%
    select(-Var)
}

fun1b <- function() {
  Data %>%
    gather(Var, Val, TitlesFormat1:TitlesFormat3) %>%
    group_by(Time, Values, Locations) %>%
    select(-Var)
} 

fun2a <- function() {
  melt(Data, id=1:3)[, variable := NULL]
}

fun2b <- function() {
  melt(Data, id=1:3)[, variable := NULL][, .SD, by="Time,Values,Locations"]
}
library(microbenchmark)
microbenchmark(fun1a(), fun2a(), fun1b(), fun2b(), times = 50)
# Unit: milliseconds
#     expr        min         lq     median        uq      max neval
#  fun1a()  116.08640  174.88565  321.20920  406.0018  475.027    50
#  fun2a()   85.71839   87.13557   97.65836  163.5093  423.566    50
#  fun1b()  856.71950 1049.25575 1107.36416 1227.6997 1406.043    50
#  fun2b() 1159.17395 1322.75210 1392.12119 1434.5502 1543.636    50