Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/73.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 将变量值重塑为列的最快方法_R_Performance_Reshape - Fatal编程技术网

R 将变量值重塑为列的最快方法

R 将变量值重塑为列的最快方法,r,performance,reshape,R,Performance,Reshape,我有一个数据集,大约有300万行,结构如下: PatientID| Year | PrimaryConditionGroup --------------------------------------- 1 | Y1 | TRAUMA 1 | Y1 | PREGNANCY 2 | Y2 | SEIZURE 3 | Y1 | TRAUMA 作为R的新手,我很难找到正确的方法将数据重塑为下面概述的结构: PatientID|

我有一个数据集,大约有300万行,结构如下:

PatientID| Year | PrimaryConditionGroup
---------------------------------------
1        | Y1   | TRAUMA
1        | Y1   | PREGNANCY
2        | Y2   | SEIZURE
3        | Y1   | TRAUMA
作为R的新手,我很难找到正确的方法将数据重塑为下面概述的结构:

PatientID| Year | TRAUMA | PREGNANCY | SEIZURE
----------------------------------------------
1        | Y1   | 1      | 1         | 0
2        | Y2   | 0      | 0         | 1
3        | Y1   | 1      | 0         | 1

我的问题是:创建data.frame的最快/最优雅的方法是什么,其中PrimaryConditionGroup的值变成列,按PatientID和年份分组(计算发生次数)?

可能有更简洁的方法,但就绝对的速度而言,很难打败基于
数据表的解决方案:

df <- read.table(text="PatientID Year  PrimaryConditionGroup
1         Y1    TRAUMA
1         Y1    PREGNANCY
2         Y2    SEIZURE
3         Y1    TRAUMA", header=T)

library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))

dt[ , list(TRAUMA =    sum(PrimaryConditionGroup=="TRAUMA"),
           PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
           SEIZURE =   sum(PrimaryConditionGroup=="SEIZURE")),
   by = list(PatientID, Year)]

#      PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,]         1   Y1      1         1       0
# [2,]         2   Y2      0         0       1
# [3,]         3   Y1      1         0       0
第二次编辑最后,使用
重塑
包的简洁解决方案将您带到相同的位置

library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)
库(重塑)

mdf有fast
melt
dcast
数据。在
=1.9.0
版本中,用C实现了特定于表格的方法。下面是与@Josh的帖子中关于300万行数据的其他优秀答案的比较(不包括base:::aggregate,因为它花费了相当长的时间)

有关新闻输入的更多信息,请转到

我假设你有1000个病人,总共5年。您可以相应地调整变量
患者
年份

require(data.table) ## >= 1.9.0
require(reshape2)

set.seed(1L)
patients = 1000L
year = 5L
n = 3e6L
condn = c("TRAUMA", "PREGNANCY", "SEIZURE")

# dummy data
DT <- data.table(PatientID = sample(patients, n, TRUE),
                 Year = sample(year, n, TRUE), 
                 PrimaryConditionGroup = sample(condn, n, TRUE))

DT_dcast <- function(DT) {
    dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
}

reshape2_dcast <- function(DT) {
    reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
}

DT_raw <- function(DT) {
    DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
            PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
              SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
    by = list(PatientID, Year)]
}

# system.time(.) timed 3 times
#         Method Time_rep1 Time_rep2 Time_rep3
#       dcast_DT     0.393     0.399     0.396
#    reshape2_DT     3.784     3.457     3.605
#         DT_raw     0.647     0.680     0.657
require(data.table)##>=1.9.0
要求(2)
结实种子(1L)
患者=1000L
年份=5L
n=3e6L
condn=c(“创伤”、“怀孕”、“癫痫发作”)
#虚拟数据

Dt+1 <代码> DDPLY 不会太少打字,真的,它当然会慢得多。你为什么还要考虑DDPULL来解决这个问题?嗨,Josh,谢谢你,这和预期的一样,效果很好。重塑数据的最简洁/惯用的方法是什么(如果性能不是问题的话)嗨,马特——我刚刚找到了另一种解决方案,并将其添加到帖子中。这看起来更简洁/惯用吗?在MS SQL表中是否有这样做的方法
require(data.table) ## >= 1.9.0
require(reshape2)

set.seed(1L)
patients = 1000L
year = 5L
n = 3e6L
condn = c("TRAUMA", "PREGNANCY", "SEIZURE")

# dummy data
DT <- data.table(PatientID = sample(patients, n, TRUE),
                 Year = sample(year, n, TRUE), 
                 PrimaryConditionGroup = sample(condn, n, TRUE))

DT_dcast <- function(DT) {
    dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
}

reshape2_dcast <- function(DT) {
    reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
}

DT_raw <- function(DT) {
    DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
            PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
              SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
    by = list(PatientID, Year)]
}

# system.time(.) timed 3 times
#         Method Time_rep1 Time_rep2 Time_rep3
#       dcast_DT     0.393     0.399     0.396
#    reshape2_DT     3.784     3.457     3.605
#         DT_raw     0.647     0.680     0.657