重复行三次,并根据R中的预定格式替换标题
我想我不能在标题中很好地解释这一点。因此,给出了一个示例!我有10公里以上的记录是这样的重复行三次,并根据R中的预定格式替换标题,r,R,我想我不能在标题中很好地解释这一点。因此,给出了一个示例!我有10公里以上的记录是这样的 Data <- data.table( Time= sample(1:50), Values = sample(1:50), Locations= sample(c("PlaceA","PlaceB","PlaceC"),50 , replace= TRUE), TitlesFormat1= sample(c("TitleA", "TitleB","TitleC"), 50, repl
Data <- data.table(
Time= sample(1:50),
Values = sample(1:50),
Locations= sample(c("PlaceA","PlaceB","PlaceC"),50 , replace= TRUE),
TitlesFormat1= sample(c("TitleA", "TitleB","TitleC"), 50, replace = TRUE),
key=c("TitlesFormat1,Time")
)
Data$TitlesFormat2<-paste0(Data$TitlesFormat1,"_(topic)")
Data$TitlesFormat3<-paste0(Data$TitlesFormat1,"_(",Data$Locations,"_topic)")
head(Data)
Time Values Locations TitlesFormat1 TitlesFormat2 TitlesFormat3
2 49 PlaceC TitleA TitleA_(topic) TitleA_(PlaceC_topic)
6 41 PlaceA TitleA TitleA_(topic) TitleA_(PlaceA_topic)
8 40 PlaceA TitleA TitleA_(topic) TitleA_(PlaceA_topic)
13 15 PlaceB TitleA TitleA_(topic) TitleA_(PlaceB_topic)
14 11 PlaceC TitleA TitleA_(topic) TitleA_(PlaceC_topic)
18 17 PlaceC TitleA TitleA_(topic) TitleA_(PlaceC_topic)
有什么建议吗?
提前感谢您的帮助 在我们得出答案之前,有几点:
- 如果您使用的是
,请使用样本
,以确保再现性set.seed(.)
中惯用的方法是使用data.table
操作符通过引用添加/更新列。否则,使用:=
s没有任何好处data.table
您可以使用
melt.data.table
函数将data.table转换为长格式:
require(reshape2)
ans <- melt(Data, id=1:3)[, variable := NULL]
# Time Values Locations value
# 1: 3 44 PlaceA TitleA
# 2: 7 15 PlaceC TitleA
# 3: 12 3 PlaceC TitleA
# 4: 13 7 PlaceA TitleA
# 5: 15 13 PlaceC TitleA
# ---
# 146: 43 36 PlaceB TitleC_(PlaceB_topic)
# 147: 44 46 PlaceB TitleC_(PlaceB_topic)
# 148: 46 6 PlaceC TitleC_(PlaceC_topic)
# 149: 48 29 PlaceC TitleC_(PlaceC_topic)
# 150: 50 11 PlaceB TitleC_(PlaceB_topic)
根据@Ananda的基准测试,我意识到我们可以通过简单的
order()
而不是在这里使用.SD
:
ans[order(Time, Values, Locations)]
DT中的order()
经过优化,可以使用数据。表的快速排序(从v1.9.3+开始),因此这应该比以前的.SD
版本快得多
以下是更新的时间安排:
# Unit: milliseconds
# expr min lq median uq max neval
# fun1a() 137.79719 154.68321 210.5660 242.4496 565.8980 50
# fun2a() 92.80878 96.90226 139.4311 166.3089 472.6021 50
# fun1b() 750.38312 828.79247 855.2852 940.3480 1151.7485 50
# fun2b() 1059.37594 1238.60744 1332.6860 1417.6680 1502.5817 50
# fun2c() 474.23736 543.14490 580.7551 623.6124 819.4660 50
其中fun2c()
是:
fun2c <- function() {
melt(Data, id=1:3)[, variable := NULL][order(Time,Values,Locations)]
}
fun2c这里是另一个使用“dplyr”+“tidyr”的替代方法。这更多是为了提供多样性,但它的性能也相当不错。基准测试还表明,稍后重新排列行顺序实际上是一个相当昂贵的操作
方法如下:
library(dplyr)
library(tidyr)
Data %>%
gather(Var, Val, TitlesFormat1:TitlesFormat3) %>%
group_by(Time, Values, Locations) %>%
select(-Var)
在更大的集合上进行测试,以下是一些示例数据:
set.seed(1)
n <- 1000000
Data <- data.table(
Time = sample(n),
Values = sample(n),
Locations = sample(c("PlaceA","PlaceB","PlaceC"), n, TRUE),
TitlesFormat1 = sample(c("TitleA", "TitleB","TitleC"), n, TRUE),
key = "TitlesFormat1,Time"
)
Data$TitlesFormat2 <- paste0(Data$TitlesFormat1, "_(topic)")
Data$TitlesFormat3 <- paste0(Data$TitlesFormat1,
"_(",Data$Locations,"_topic)")
非常感谢你的建议!我是新手,所以你的回答真的很有帮助!我肯定会注意到你提出的两点!:)
library(dplyr)
library(tidyr)
Data %>%
gather(Var, Val, TitlesFormat1:TitlesFormat3) %>%
group_by(Time, Values, Locations) %>%
select(-Var)
set.seed(1)
n <- 1000000
Data <- data.table(
Time = sample(n),
Values = sample(n),
Locations = sample(c("PlaceA","PlaceB","PlaceC"), n, TRUE),
TitlesFormat1 = sample(c("TitleA", "TitleB","TitleC"), n, TRUE),
key = "TitlesFormat1,Time"
)
Data$TitlesFormat2 <- paste0(Data$TitlesFormat1, "_(topic)")
Data$TitlesFormat3 <- paste0(Data$TitlesFormat1,
"_(",Data$Locations,"_topic)")
fun1a <- function() {
Data %>%
gather(Var, Val, TitlesFormat1:TitlesFormat3) %>%
select(-Var)
}
fun1b <- function() {
Data %>%
gather(Var, Val, TitlesFormat1:TitlesFormat3) %>%
group_by(Time, Values, Locations) %>%
select(-Var)
}
fun2a <- function() {
melt(Data, id=1:3)[, variable := NULL]
}
fun2b <- function() {
melt(Data, id=1:3)[, variable := NULL][, .SD, by="Time,Values,Locations"]
}
library(microbenchmark)
microbenchmark(fun1a(), fun2a(), fun1b(), fun2b(), times = 50)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1a() 116.08640 174.88565 321.20920 406.0018 475.027 50
# fun2a() 85.71839 87.13557 97.65836 163.5093 423.566 50
# fun1b() 856.71950 1049.25575 1107.36416 1227.6997 1406.043 50
# fun2b() 1159.17395 1322.75210 1392.12119 1434.5502 1543.636 50