R将一列的所有值除以同一列但不同组中的值
我有一个数据表,如下所示-R将一列的所有值除以同一列但不同组中的值,r,data.table,R,Data.table,我有一个数据表,如下所示- library(data.table) dt <- structure(list(date = structure(c(17956L, 17959L, 17960L, 17962L, 17963L, 17966L, 17967L, 17968L, 17969L, 17970L, 17973L, 17974L, 17975L,
library(data.table)
dt <- structure(list(date = structure(c(17956L, 17959L, 17960L,
17962L, 17963L, 17966L, 17967L, 17968L, 17969L, 17970L, 17973L,
17974L, 17975L, 17976L, 17977L, 17980L, 17981L, 17982L, 17983L,
17984L, 17956L, 17959L, 17960L, 17961L, 17962L, 17963L, 17966L,
17967L, 17968L, 17980L, 17981L, 17982L, 17983L, 17984L), class = c("IDate", "Date")),
group = c("A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B"),
value = c(43.7425,
43.9625, 43.8825, 43.125, 43.2275, 44.725, 45.2275, 45.4275,
45.9325, 46.53, 47.005, 46.6325, 47.04, 48.7725, 47.7625, 47.185,
46.6975, 47.1175, 47.18, 47.4875, 12.31, 12.51, 12.7, 12.4, 12.63,
12.93, 13.18, 13.23, 13.35, 14.27, 14.5, 14.25, 13.88, 13.71)),
row.names = c(NA, -34L), class = c("data.table", "data.frame"))
> dt
date group value
1: 2019-03-01 A 43.7425
2: 2019-03-04 A 43.9625
3: 2019-03-05 A 43.8825
5: 2019-03-07 A 43.1250
6: 2019-03-08 A 43.2275
7: 2019-03-11 A 44.7250
8: 2019-03-12 A 45.2275
9: 2019-03-13 A 45.4275
10: 2019-03-14 A 45.9325
11: 2019-03-15 A 46.5300
12: 2019-03-18 A 47.0050
13: 2019-03-19 A 46.6325
14: 2019-03-20 A 47.0400
15: 2019-03-21 A 48.7725
16: 2019-03-22 A 47.7625
17: 2019-03-25 A 47.1850
18: 2019-03-26 A 46.6975
19: 2019-03-27 A 47.1175
20: 2019-03-28 A 47.1800
21: 2019-03-29 A 47.4875
22: 2019-03-01 B 12.3100
23: 2019-03-04 B 12.5100
24: 2019-03-05 B 12.7000
25: 2019-03-06 B 12.4000
26: 2019-03-07 B 12.6300
27: 2019-03-08 B 12.9300
28: 2019-03-11 B 13.1800
29: 2019-03-12 B 13.2300
30: 2019-03-13 B 13.3500
31: 2019-03-25 B 14.2700
32: 2019-03-26 B 14.5000
33: 2019-03-27 B 14.2500
34: 2019-03-28 B 13.8800
35: 2019-03-29 B 13.7100
请注意-
比率
列中的比率值我希望有一种简单的方法可以使用
data.table
库来解决这个问题。我可以提供一种简洁的方法
library(tidyr)
library(dplyr)
df1 <- df %>%
pivot_wider(names_from = group, values_from = value) %>%
mutate(ratio = B/A)
library(tidyr)
图书馆(dplyr)
df1%
透视图(名称从=组,值从=值)%>%
突变(比率=B/A)
以下是使用data.table的一行解决方案。它会处理丢失的值
setDT(dt)[, ratio := value/value[group=="A"] , date]
date group value ratio
1: 2019-03-01 A 43.7425 1.0000000
2: 2019-03-04 A 43.9625 1.0000000
3: 2019-03-05 A 43.8825 1.0000000
4: 2019-03-07 A 43.1250 1.0000000
5: 2019-03-08 A 43.2275 1.0000000
6: 2019-03-11 A 44.7250 1.0000000
7: 2019-03-12 A 45.2275 1.0000000
8: 2019-03-13 A 45.4275 1.0000000
9: 2019-03-14 A 45.9325 1.0000000
10: 2019-03-15 A 46.5300 1.0000000
11: 2019-03-18 A 47.0050 1.0000000
12: 2019-03-19 A 46.6325 1.0000000
13: 2019-03-20 A 47.0400 1.0000000
14: 2019-03-21 A 48.7725 1.0000000
15: 2019-03-22 A 47.7625 1.0000000
16: 2019-03-25 A 47.1850 1.0000000
17: 2019-03-26 A 46.6975 1.0000000
18: 2019-03-27 A 47.1175 1.0000000
19: 2019-03-28 A 47.1800 1.0000000
20: 2019-03-29 A 47.4875 1.0000000
21: 2019-03-01 B 12.3100 0.2814197
22: 2019-03-04 B 12.5100 0.2845607
23: 2019-03-05 B 12.7000 0.2894092
24: 2019-03-06 B 12.4000 NA
25: 2019-03-07 B 12.6300 0.2928696
26: 2019-03-08 B 12.9300 0.2991151
27: 2019-03-11 B 13.1800 0.2946898
28: 2019-03-12 B 13.2300 0.2925211
29: 2019-03-13 B 13.3500 0.2938749
30: 2019-03-25 B 14.2700 0.3024266
31: 2019-03-26 B 14.5000 0.3105091
32: 2019-03-27 B 14.2500 0.3024354
33: 2019-03-28 B 13.8800 0.2941925
34: 2019-03-29 B 13.7100 0.2887076
date group value ratio
我们也可以这样做
library(data.table)
dt[order(date, group), ratio := value/first(value), date]
我的建议是在
date
上执行dt
与子集dt[group==“A”]
的更新联接。这将自动处理丢失的值:
dt[dt[group == "A"], on = "date", ratio := x.value / i.value][]
请注意,这种方法在第24行返回NA
,因为A组中没有相应的日期2019-03-06
基准
由于提出了几种解决方案,我想知道在执行速度和内存消耗方面有什么区别:
“分组”:按日期分组的分组
“加入”:按照此答案中的建议更新加入
“重塑”、“重塑2”:的data.table
版本,但经过大量扩展,以两种风格返回与其他答案相同的结果
对于基准测试,将测试台
包用于不同的问题大小。还模拟了缺失值对日期的影响,这些缺失值会导致某些日期不匹配
这是通过分别为每个组A
和B
创建单独的数据表来实现的,其行数比指定的问题大小n
多10%。从这两个data.tablen
行中,对每一行进行采样,并将其组合到实际的基准数据中。tabledt0
dt0
有2行n
行
还请注意,每个基准测试运行都从一个新的dt0
副本开始,因为有些方法会修改输入数据
请注意对数时间刻度
对于较小规模的问题,小组方法最快。随着问题规模的增加,这种方法被其他方法取代,特别是连接方法。对于n=107,join的速度大约是group的七倍。令人惊讶的是,尽管数据被前后重塑,但重塑方法是第二快的
虽然该组的速度相对较慢,问题规模较大,但内存占用最少(mem\u alloc
)
请注意,即使对于最大的问题大小,所有计时都在20秒以下。因此,只有在多次重复该操作时,任何速度差异都可能很重要。我仍然不确定您想要的输出是什么。维度是多少?收音机看起来像是B组的值除以A组的相同日期。这是正确的。结果的形状有些不同。data.table
等价物将是dcast(dt,date~group)[,ratio:=B/A]
。不幸的是,答案没有显示输出,这隐藏了一个事实,即答案在第24行返回了1.0的比率,尽管在A
组中没有相应的2019-03-06年日期。这与此处返回NA
的其他答案形成对比,包括。
dt[dt[group == "A"], on = "date", ratio := x.value / i.value][]
date group value ratio
1: 2019-03-01 A 43.7425 1.0000000
2: 2019-03-04 A 43.9625 1.0000000
3: 2019-03-05 A 43.8825 1.0000000
4: 2019-03-07 A 43.1250 1.0000000
5: 2019-03-08 A 43.2275 1.0000000
6: 2019-03-11 A 44.7250 1.0000000
7: 2019-03-12 A 45.2275 1.0000000
8: 2019-03-13 A 45.4275 1.0000000
9: 2019-03-14 A 45.9325 1.0000000
10: 2019-03-15 A 46.5300 1.0000000
11: 2019-03-18 A 47.0050 1.0000000
12: 2019-03-19 A 46.6325 1.0000000
13: 2019-03-20 A 47.0400 1.0000000
14: 2019-03-21 A 48.7725 1.0000000
15: 2019-03-22 A 47.7625 1.0000000
16: 2019-03-25 A 47.1850 1.0000000
17: 2019-03-26 A 46.6975 1.0000000
18: 2019-03-27 A 47.1175 1.0000000
19: 2019-03-28 A 47.1800 1.0000000
20: 2019-03-29 A 47.4875 1.0000000
21: 2019-03-01 B 12.3100 0.2814197
22: 2019-03-04 B 12.5100 0.2845607
23: 2019-03-05 B 12.7000 0.2894092
24: 2019-03-06 B 12.4000 NA
25: 2019-03-07 B 12.6300 0.2928696
26: 2019-03-08 B 12.9300 0.2991151
27: 2019-03-11 B 13.1800 0.2946898
28: 2019-03-12 B 13.2300 0.2925211
29: 2019-03-13 B 13.3500 0.2938749
30: 2019-03-25 B 14.2700 0.3024266
31: 2019-03-26 B 14.5000 0.3105091
32: 2019-03-27 B 14.2500 0.3024354
33: 2019-03-28 B 13.8800 0.2941925
34: 2019-03-29 B 13.7100 0.2887076
date group value ratio
library(bench)
library(ggplot2)
bm <- press(
n = 10^(3:7)
, {
nx <- as.integer(n * 1.1)
dates <- seq(as.IDate("1970-01-01"), by = 1L, length.out = nx)
dtA <- data.table(date = dates, group = "A", value = (1:nx) * pi)
dtB <- data.table(date = dates, group = "B", value = (1:nx) * 2*pi)
set.seed(123)
dt0 <- rbind(dtA[sample(nx, n)], dtB[sample(nx, n)])
setorder(dt0, group, date)
mark(
join = {
dt <- copy(dt0)
dt[dt[group == "A"], on = "date", ratio := x.value / i.value]
},
group = {
dt <- copy(dt0)
dt[, ratio := value/value[group=="A"] , date]
},
reshape = {
dt <- copy(dt0)
dcast(dt, date ~ group)[, c("ratioA", "ratioB") := .(A/A, B/A)][
, melt(.SD, measure.vars = list(value = c("A", "B"), ratio = c("ratioA", "ratioB")),
variable.name = "group")][
!(is.na(value) & is.na(ratio))][
, group := c("A", "B")[group]]
},
reshape2 = {
dt <- copy(dt0)
dcast(dt, date ~ group)[, c("ratioA", "ratioB") := .(rep(1.0, .N), B/A)][
, melt(.SD, measure.vars = patterns(value = "^[AB]", ratio = "^ratio"),
variable.name = "group")][
, group := c("A", "B")[group]][
!is.na(value)]
},
check = function(x,y) all.equal(x, y, check.attributes = FALSE),
min_iterations = 3L
)
}
)
bm[, 1:10]
# A tibble: 20 x 10
expression n min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 join 1000 3.19ms 3.88ms 243. 355.42KB 2.10 116 1 477.13ms
2 group 1000 2.27ms 2.81ms 342. 137.24KB 2.05 167 1 488.5ms
3 reshape 1000 4.74ms 6.26ms 155. 805.56KB 2.12 73 1 472.21ms
4 reshape2 1000 4.68ms 6.18ms 159. 797.66KB 2.09 76 1 479.4ms
5 join 10000 4.86ms 5.96ms 161. 1.9MB 0 81 0 503.38ms
6 group 10000 18.68ms 19.77ms 49.7 930.32KB 4.52 22 2 442.34ms
7 reshape 10000 9.13ms 11.59ms 83.9 4.29MB 0 42 0 500.68ms
8 reshape2 10000 10.58ms 12.83ms 76.7 4.21MB 0 39 0 508.78ms
9 join 100000 23.43ms 28.14ms 35.2 17.41MB 0 18 0 512.06ms
10 group 100000 187.33ms 192.88ms 5.18 9.13MB 2.59 2 1 385.76ms
11 reshape 100000 51.8ms 57.67ms 17.4 39.31MB 2.17 8 1 460.72ms
12 reshape2 100000 50.59ms 56.78ms 17.3 38.55MB 0 9 0 520.46ms
13 join 1000000 183.66ms 184.12ms 5.40 172.53MB 0 3 0 555.4ms
14 group 1000000 1.83s 1.98s 0.486 86.12MB 2.43 3 15 6.18s
15 reshape 1000000 473.52ms 492.22ms 2.05 389.47MB 0 3 0 1.46s
16 reshape2 1000000 498.48ms 505.92ms 1.97 381.84MB 0 3 0 1.52s
17 join 10000000 2.01s 2.44s 0.432 1.68GB 0.576 3 4 6.95s
18 group 10000000 18.41s 18.7s 0.0531 860.1MB 2.76 3 156 56.46s
19 reshape 10000000 6.07s 6.46s 0.142 3.8GB 0.237 3 5 21.1s
20 reshape2 10000000 6.01s 6.02s 0.161 3.73GB 0.322 3 6 18.65s
autoplot(bm)