R 将变量值重塑为列的最快方法
我有一个数据集,大约有300万行,结构如下:R 将变量值重塑为列的最快方法,r,performance,reshape,R,Performance,Reshape,我有一个数据集,大约有300万行,结构如下: PatientID| Year | PrimaryConditionGroup --------------------------------------- 1 | Y1 | TRAUMA 1 | Y1 | PREGNANCY 2 | Y2 | SEIZURE 3 | Y1 | TRAUMA 作为R的新手,我很难找到正确的方法将数据重塑为下面概述的结构: PatientID|
PatientID| Year | PrimaryConditionGroup
---------------------------------------
1 | Y1 | TRAUMA
1 | Y1 | PREGNANCY
2 | Y2 | SEIZURE
3 | Y1 | TRAUMA
作为R的新手,我很难找到正确的方法将数据重塑为下面概述的结构:
PatientID| Year | TRAUMA | PREGNANCY | SEIZURE
----------------------------------------------
1 | Y1 | 1 | 1 | 0
2 | Y2 | 0 | 0 | 1
3 | Y1 | 1 | 0 | 1
我的问题是:创建data.frame的最快/最优雅的方法是什么,其中PrimaryConditionGroup的值变成列,按PatientID和年份分组(计算发生次数)?可能有更简洁的方法,但就绝对的速度而言,很难打败基于
数据表的解决方案:
df <- read.table(text="PatientID Year PrimaryConditionGroup
1 Y1 TRAUMA
1 Y1 PREGNANCY
2 Y2 SEIZURE
3 Y1 TRAUMA", header=T)
library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))
dt[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]
# PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,] 1 Y1 1 1 0
# [2,] 2 Y2 0 0 1
# [3,] 3 Y1 1 0 0
第二次编辑最后,使用重塑
包的简洁解决方案将您带到相同的位置
library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)
库(重塑)
mdf有fastmelt
和dcast
数据。在=1.9.0
版本中,用C实现了特定于表格的方法。下面是与@Josh的帖子中关于300万行数据的其他优秀答案的比较(不包括base:::aggregate,因为它花费了相当长的时间)
有关新闻输入的更多信息,请转到
我假设你有1000个病人,总共5年。您可以相应地调整变量患者
和年份
require(data.table) ## >= 1.9.0
require(reshape2)
set.seed(1L)
patients = 1000L
year = 5L
n = 3e6L
condn = c("TRAUMA", "PREGNANCY", "SEIZURE")
# dummy data
DT <- data.table(PatientID = sample(patients, n, TRUE),
Year = sample(year, n, TRUE),
PrimaryConditionGroup = sample(condn, n, TRUE))
DT_dcast <- function(DT) {
dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
}
reshape2_dcast <- function(DT) {
reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
}
DT_raw <- function(DT) {
DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]
}
# system.time(.) timed 3 times
# Method Time_rep1 Time_rep2 Time_rep3
# dcast_DT 0.393 0.399 0.396
# reshape2_DT 3.784 3.457 3.605
# DT_raw 0.647 0.680 0.657
require(data.table)##>=1.9.0
要求(2)
结实种子(1L)
患者=1000L
年份=5L
n=3e6L
condn=c(“创伤”、“怀孕”、“癫痫发作”)
#虚拟数据
Dt+1 <代码> DDPLY 不会太少打字,真的,它当然会慢得多。你为什么还要考虑DDPULL来解决这个问题?嗨,Josh,谢谢你,这和预期的一样,效果很好。重塑数据的最简洁/惯用的方法是什么(如果性能不是问题的话)嗨,马特——我刚刚找到了另一种解决方案,并将其添加到帖子中。这看起来更简洁/惯用吗?在MS SQL表中是否有这样做的方法
require(data.table) ## >= 1.9.0
require(reshape2)
set.seed(1L)
patients = 1000L
year = 5L
n = 3e6L
condn = c("TRAUMA", "PREGNANCY", "SEIZURE")
# dummy data
DT <- data.table(PatientID = sample(patients, n, TRUE),
Year = sample(year, n, TRUE),
PrimaryConditionGroup = sample(condn, n, TRUE))
DT_dcast <- function(DT) {
dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
}
reshape2_dcast <- function(DT) {
reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
}
DT_raw <- function(DT) {
DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]
}
# system.time(.) timed 3 times
# Method Time_rep1 Time_rep2 Time_rep3
# dcast_DT 0.393 0.399 0.396
# reshape2_DT 3.784 3.457 3.605
# DT_raw 0.647 0.680 0.657