R 使用data.table按分组变量查找较大或较小的值
我的源数据有几个月对应的数据,但在这些数据中,我只想比较来自预先指定月份的数据 这是我的输入数据:R 使用data.table按分组变量查找较大或较小的值,r,dplyr,data.table,R,Dplyr,Data.table,我的源数据有几个月对应的数据,但在这些数据中,我只想比较来自预先指定月份的数据 这是我的输入数据: dput(mydf) structure(list(Month = structure(c(1L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 2L, 1L), .Label = c("Aug", "Oct", "Sep"), class = "factor"), Pipe = c(3, 4, 5, 3, 2, 1, 3, 3, 4, NA, 5), Gp = st
dput(mydf)
structure(list(Month = structure(c(1L, 2L, 1L, 2L, 3L, 1L, 2L,
2L, 1L, 2L, 1L), .Label = c("Aug", "Oct", "Sep"), class = "factor"),
Pipe = c(3, 4, 5, 3, 2, 1, 3, 3, 4, NA, 5), Gp = structure(c(1L,
1L, 2L, 2L, 2L, 3L, 4L, 5L, 5L, 6L, 6L), .Label = c("A",
"B", "C", "D", "E", "F"), class = "factor")), .Names = c("Month",
"Pipe", "Gp"), row.names = c(NA, -11L), class = "data.frame")
现在,在这三个月中,我只想比较以下变量指定的月份
This_month_to_compare <- "Oct"
Last_Month_to_compare <- "Aug"
我已经在上面手动添加了解释
我确实尝试过编码,以下是我的尝试:
mydfi<-data.table::as.data.table(mydfi)
mydf<-mydfi
#Method 1: Convert to Wide Format
#Convert to wide format
mydf<-data.table::dcast(mydf,Gp ~ Month, value.var = "Pipe")
#Compare
mydf$Growth<-mydf[[This_month_to_compare]]>mydf[[Last_Month_to_compare]]
#Back to long format
Melt_columns<-c("Aug","Oct","Sep")
mydf<-data.table::melt(mydf, measure.vars =Melt_columns,variable.name = "Month", value.name = "Pipe")
mydfo<-mydf[mydfi,on=c("Month","Gp","Pipe")]
mydfo[Month!=This_month_to_compare,"Growth"]<-NA
更新:我可以通过添加一个左连接来解决上述问题。我已经更新了上面的代码。但是,我正在寻找以下方面的解决方案:
原因是我的实际数据集非常大,不允许连接
任何帮助都将不胜感激。提前谢谢。这就是你想的吗
> library(data.table)
> mydf <- data.table(mydf)
> This_month_to_compare <- "Oct"
> Last_Month_to_compare <- "Aug"
> setkey(mydf, Gp, Month)
>
> # Make dummy table to join with
> mydf[
+ , Pipe_this := .SD[Month == This_month_to_compare, Pipe], by = "Gp"][
+ , Pipe_last := .SD[Month == Last_Month_to_compare, Pipe], by = "Gp"][
+ , `:=`(
+ Greater = Pipe_last < Pipe_this, Pipe_last = NULL, Pipe_this = NULL)][
+ Month != "Oct", Greater := NA]
> mydf
Month Pipe Gp Greater
1: Aug 3 A NA
2: Oct 4 A TRUE
3: Aug 5 B NA
4: Oct 3 B FALSE
5: Sep 2 B NA
6: Aug 1 C NA
7: Oct 3 D NA
8: Aug 4 E NA
9: Oct 3 E FALSE
10: Aug 5 F NA
11: Oct NA F NA
您可以简化代码以避免以下两种情况:[.data.table如果需要,可以从上面调用,并避免定义管道this和管道last。这可以通过两个联接来实现。第一个联接过滤出要比较的月份,并根据需要对其排序。然后可以进行比较。第二个联接将结果附加到原始数据帧
library(data.table)
# Last_Month_to_compare, This_month_to_compare
months_to_compare <- c("Aug", "Oct")
mDT <- setDT(mydf)[
# append row id column (to preserve original order)
, rn := .I][
# cross join of groups and months
CJ(Gp = Gp, Month = months_to_compare, unique = TRUE), on = .(Gp, Month)][
# groupwise comparison of the two months
, Greater := Pipe > shift(Pipe), by = Gp][]
# appending result to original data frame by joining with intermediate result
mydf[mDT, on = .(rn), Greater := i.Greater][]
请注意,mydf的原始顺序被保留
mDT的中间结果如下所示
编辑:补充说明
OP要求解释mydf[mDT,on=.rn]和mydf[mDT,on=.rn,morer:=i.morer][]之间的差异
对于data.table,X[Y,on=…]是一个右外部联接,它相当于mergeX,Y,all.Y=TRUE,即它返回Y的所有行,请参见
返回
以i.为前缀的列来自mDT。请注意,第6行和第7行在mydf中没有匹配的行。此外,行的顺序由mDT中的顺序决定
如果mydf和mDT互换
mDT[mydf, on = .(rn)][]
返回
以i.为前缀的列现在来自mydf。请注意,mDT中的第5行不匹配。此外,行的顺序由mydf确定
使用赋值运算符:=,X[Y,on=…,a:=b]将成为一个左内联接,它按原始顺序包含X的所有行。因此
mydf[mDT, on = .(rn), Greater := i.Greater][]
返回
其中,对于不匹配的行,较大值变为NA。@Uwe-感谢您的帮助。您能帮助我理解mydf[mDT,on=.rn][]和mydf[mDT,on=.rn,较大值:=i.morer][]之间的区别吗.我了解前者的情况。这是rn上的左连接,但我不确定Greater:=I.Greater做什么。我感谢您的帮助。
Month Pipe Gp rn Greater
1: Aug 3 A 1 NA
2: Oct 4 A 2 TRUE
3: Aug 5 B 3 NA
4: Oct 3 B 4 FALSE
5: Aug 1 C 6 NA
6: Oct NA C NA NA
7: Aug NA D NA NA
8: Oct 3 D 7 NA
9: Aug 4 E 9 NA
10: Oct 3 E 8 FALSE
11: Aug 5 F 11 NA
12: Oct NA F 10 NA
mydf[mDT, on = .(rn)]
Month Pipe Gp rn i.Month i.Pipe i.Gp Greater
1: Aug 3 A 1 Aug 3 A NA
2: Oct 4 A 2 Oct 4 A TRUE
3: Aug 5 B 3 Aug 5 B NA
4: Oct 3 B 4 Oct 3 B FALSE
5: Aug 1 C 6 Aug 1 C NA
6: NA NA NA NA Oct NA C NA
7: NA NA NA NA Aug NA D NA
8: Oct 3 D 7 Oct 3 D NA
9: Aug 4 E 9 Aug 4 E NA
10: Oct 3 E 8 Oct 3 E FALSE
11: Aug 5 F 11 Aug 5 F NA
12: Oct NA F 10 Oct NA F NA
mDT[mydf, on = .(rn)][]
Month Pipe Gp rn Greater i.Month i.Pipe i.Gp
1: Aug 3 A 1 NA Aug 3 A
2: Oct 4 A 2 TRUE Oct 4 A
3: Aug 5 B 3 NA Aug 5 B
4: Oct 3 B 4 FALSE Oct 3 B
5: NA NA NA 5 NA Sep 2 B
6: Aug 1 C 6 NA Aug 1 C
7: Oct 3 D 7 NA Oct 3 D
8: Oct 3 E 8 FALSE Oct 3 E
9: Aug 4 E 9 NA Aug 4 E
10: Oct NA F 10 NA Oct NA F
11: Aug 5 F 11 NA Aug 5 F
mydf[mDT, on = .(rn), Greater := i.Greater][]
Month Pipe Gp rn Greater
1: Aug 3 A 1 NA
2: Oct 4 A 2 TRUE
3: Aug 5 B 3 NA
4: Oct 3 B 4 FALSE
5: Sep 2 B 5 NA
6: Aug 1 C 6 NA
7: Oct 3 D 7 NA
8: Oct 3 E 8 FALSE
9: Aug 4 E 9 NA
10: Oct NA F 10 NA
11: Aug 5 F 11 NA