R 将变量值重塑为列的最快方法_R_Performance_Reshape

R 将变量值重塑为列的最快方法

r performance

R 将变量值重塑为列的最快方法,r,performance,reshape,R,Performance,Reshape,我有一个数据集，大约有300万行，结构如下： PatientID| Year | PrimaryConditionGroup --------------------------------------- 1 | Y1 | TRAUMA 1 | Y1 | PREGNANCY 2 | Y2 | SEIZURE 3 | Y1 | TRAUMA 作为R的新手，我很难找到正确的方法将数据重塑为下面概述的结构： PatientID|

我有一个数据集，大约有300万行，结构如下：

PatientID| Year | PrimaryConditionGroup
---------------------------------------
1        | Y1   | TRAUMA
1        | Y1   | PREGNANCY
2        | Y2   | SEIZURE
3        | Y1   | TRAUMA

作为R的新手，我很难找到正确的方法将数据重塑为下面概述的结构：

PatientID| Year | TRAUMA | PREGNANCY | SEIZURE
----------------------------------------------
1        | Y1   | 1      | 1         | 0
2        | Y2   | 0      | 0         | 1
3        | Y1   | 1      | 0         | 1

我的问题是：创建data.frame的最快/最优雅的方法是什么，其中PrimaryConditionGroup的值变成列，按PatientID和年份分组（计算发生次数）？

可能有更简洁的方法，但就绝对的速度而言，很难打败基于

数据表的解决方案：
df <- read.table(text="PatientID Year  PrimaryConditionGroup
1         Y1    TRAUMA
1         Y1    PREGNANCY
2         Y2    SEIZURE
3         Y1    TRAUMA", header=T)

library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))

dt[ , list(TRAUMA =    sum(PrimaryConditionGroup=="TRAUMA"),
           PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
           SEIZURE =   sum(PrimaryConditionGroup=="SEIZURE")),
   by = list(PatientID, Year)]

#      PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,]         1   Y1      1         1       0
# [2,]         2   Y2      0         0       1
# [3,]         3   Y1      1         0       0

第二次编辑最后，使用重塑
包的简洁解决方案将您带到相同的位置
library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)

库（重塑）
mdf有fastmelt
和dcast
数据。在=1.9.0
版本中，用C实现了特定于表格的方法。下面是与@Josh的帖子中关于300万行数据的其他优秀答案的比较（不包括base:：：aggregate，因为它花费了相当长的时间）
有关新闻输入的更多信息，请转到
我假设你有1000个病人，总共5年。您可以相应地调整变量患者
和年份

require(data.table) ## >= 1.9.0
require(reshape2)

set.seed(1L)
patients = 1000L
year = 5L
n = 3e6L
condn = c("TRAUMA", "PREGNANCY", "SEIZURE")

# dummy data
DT <- data.table(PatientID = sample(patients, n, TRUE),
                 Year = sample(year, n, TRUE), 
                 PrimaryConditionGroup = sample(condn, n, TRUE))

DT_dcast <- function(DT) {
    dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
}

reshape2_dcast <- function(DT) {
    reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
}

DT_raw <- function(DT) {
    DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
            PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
              SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
    by = list(PatientID, Year)]
}

# system.time(.) timed 3 times
#         Method Time_rep1 Time_rep2 Time_rep3
#       dcast_DT     0.393     0.399     0.396
#    reshape2_DT     3.784     3.457     3.605
#         DT_raw     0.647     0.680     0.657

require（data.table）##>=1.9.0
要求（2）
结实种子（1L）
患者=1000L
年份=5L
n=3e6L
condn=c（“创伤”、“怀孕”、“癫痫发作”）
#虚拟数据
Dt+1 <代码> DDPLY 不会太少打字，真的，它当然会慢得多。你为什么还要考虑DDPULL来解决这个问题？嗨，Josh，谢谢你，这和预期的一样，效果很好。重塑数据的最简洁/惯用的方法是什么（如果性能不是问题的话）嗨，马特——我刚刚找到了另一种解决方案，并将其添加到帖子中。这看起来更简洁/惯用吗？在MS SQL表中是否有这样做的方法
require(data.table) ## >= 1.9.0
require(reshape2)

set.seed(1L)
patients = 1000L
year = 5L
n = 3e6L
condn = c("TRAUMA", "PREGNANCY", "SEIZURE")

# dummy data
DT <- data.table(PatientID = sample(patients, n, TRUE),
                 Year = sample(year, n, TRUE), 
                 PrimaryConditionGroup = sample(condn, n, TRUE))

DT_dcast <- function(DT) {
    dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
}

reshape2_dcast <- function(DT) {
    reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
}

DT_raw <- function(DT) {
    DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
            PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
              SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
    by = list(PatientID, Year)]
}

# system.time(.) timed 3 times
#         Method Time_rep1 Time_rep2 Time_rep3
#       dcast_DT     0.393     0.399     0.396
#    reshape2_DT     3.784     3.457     3.605
#         DT_raw     0.647     0.680     0.657