R将一列的所有值除以同一列但不同组中的值

R将一列的所有值除以同一列但不同组中的值,r,data.table,R,Data.table,我有一个数据表,如下所示- library(data.table) dt <- structure(list(date = structure(c(17956L, 17959L, 17960L, 17962L, 17963L, 17966L, 17967L, 17968L, 17969L, 17970L, 17973L, 17974L, 17975L,

我有一个数据表,如下所示-

library(data.table)
dt <- structure(list(date = structure(c(17956L, 17959L, 17960L, 
                                  17962L, 17963L, 17966L, 17967L, 17968L, 17969L, 17970L, 17973L, 
                                  17974L, 17975L, 17976L, 17977L, 17980L, 17981L, 17982L, 17983L, 
                                  17984L, 17956L, 17959L, 17960L, 17961L, 17962L, 17963L, 17966L, 
                                  17967L, 17968L, 17980L, 17981L, 17982L, 17983L, 17984L), class = c("IDate", "Date")), 
               group = c("A", "A", "A", "A", 
                          "A", "A", "A", "A", "A", "A", "A", "A", 
                          "A", "A", "A", "A", "A", "A", "A", "A", 
                          "B", "B", "B", "B", "B", "B", "B", "B", 
                          "B", "B", "B", "B", "B", "B"), 
               value = c(43.7425, 
                         43.9625, 43.8825, 43.125, 43.2275, 44.725, 45.2275, 45.4275, 
                         45.9325, 46.53, 47.005, 46.6325, 47.04, 48.7725, 47.7625, 47.185, 
                         46.6975, 47.1175, 47.18, 47.4875, 12.31, 12.51, 12.7, 12.4, 12.63, 
                         12.93, 13.18, 13.23, 13.35, 14.27, 14.5, 14.25, 13.88, 13.71)), 
          row.names = c(NA, -34L), class = c("data.table", "data.frame"))
> dt
          date group   value
 1: 2019-03-01     A 43.7425
 2: 2019-03-04     A 43.9625
 3: 2019-03-05     A 43.8825
 5: 2019-03-07     A 43.1250
 6: 2019-03-08     A 43.2275
 7: 2019-03-11     A 44.7250
 8: 2019-03-12     A 45.2275
 9: 2019-03-13     A 45.4275
10: 2019-03-14     A 45.9325
11: 2019-03-15     A 46.5300
12: 2019-03-18     A 47.0050
13: 2019-03-19     A 46.6325
14: 2019-03-20     A 47.0400
15: 2019-03-21     A 48.7725
16: 2019-03-22     A 47.7625
17: 2019-03-25     A 47.1850
18: 2019-03-26     A 46.6975
19: 2019-03-27     A 47.1175
20: 2019-03-28     A 47.1800
21: 2019-03-29     A 47.4875
22: 2019-03-01     B 12.3100
23: 2019-03-04     B 12.5100
24: 2019-03-05     B 12.7000
25: 2019-03-06     B 12.4000
26: 2019-03-07     B 12.6300
27: 2019-03-08     B 12.9300
28: 2019-03-11     B 13.1800
29: 2019-03-12     B 13.2300
30: 2019-03-13     B 13.3500
31: 2019-03-25     B 14.2700
32: 2019-03-26     B 14.5000
33: 2019-03-27     B 14.2500
34: 2019-03-28     B 13.8800
35: 2019-03-29     B 13.7100
请注意-

  • 在<代码>比率列中,我展示了计算方法,以便于理解。我只需要
    比率
    列中的比率值
  • 组由日期列编制索引
  • 这两列中都缺少值。稍微做一点错误处理会有所帮助

  • 我希望有一种简单的方法可以使用
    data.table
    库来解决这个问题。

    我可以提供一种简洁的方法

    library(tidyr)
    library(dplyr)
    df1 <- df %>%   
      pivot_wider(names_from = group, values_from = value) %>% 
      mutate(ratio = B/A)
    
    library(tidyr)
    图书馆(dplyr)
    df1%
    透视图(名称从=组,值从=值)%>%
    突变(比率=B/A)
    
    以下是使用data.table的一行解决方案。它会处理丢失的值

    setDT(dt)[, ratio := value/value[group=="A"] , date]
              date group   value     ratio
     1: 2019-03-01     A 43.7425 1.0000000
     2: 2019-03-04     A 43.9625 1.0000000
     3: 2019-03-05     A 43.8825 1.0000000
     4: 2019-03-07     A 43.1250 1.0000000
     5: 2019-03-08     A 43.2275 1.0000000
     6: 2019-03-11     A 44.7250 1.0000000
     7: 2019-03-12     A 45.2275 1.0000000
     8: 2019-03-13     A 45.4275 1.0000000
     9: 2019-03-14     A 45.9325 1.0000000
    10: 2019-03-15     A 46.5300 1.0000000
    11: 2019-03-18     A 47.0050 1.0000000
    12: 2019-03-19     A 46.6325 1.0000000
    13: 2019-03-20     A 47.0400 1.0000000
    14: 2019-03-21     A 48.7725 1.0000000
    15: 2019-03-22     A 47.7625 1.0000000
    16: 2019-03-25     A 47.1850 1.0000000
    17: 2019-03-26     A 46.6975 1.0000000
    18: 2019-03-27     A 47.1175 1.0000000
    19: 2019-03-28     A 47.1800 1.0000000
    20: 2019-03-29     A 47.4875 1.0000000
    21: 2019-03-01     B 12.3100 0.2814197
    22: 2019-03-04     B 12.5100 0.2845607
    23: 2019-03-05     B 12.7000 0.2894092
    24: 2019-03-06     B 12.4000        NA
    25: 2019-03-07     B 12.6300 0.2928696
    26: 2019-03-08     B 12.9300 0.2991151
    27: 2019-03-11     B 13.1800 0.2946898
    28: 2019-03-12     B 13.2300 0.2925211
    29: 2019-03-13     B 13.3500 0.2938749
    30: 2019-03-25     B 14.2700 0.3024266
    31: 2019-03-26     B 14.5000 0.3105091
    32: 2019-03-27     B 14.2500 0.3024354
    33: 2019-03-28     B 13.8800 0.2941925
    34: 2019-03-29     B 13.7100 0.2887076
              date group   value     ratio
    
    我们也可以这样做

    library(data.table)
    dt[order(date, group), ratio := value/first(value), date]
    

    我的建议是在
    date
    上执行
    dt
    与子集
    dt[group==“A”]
    的更新联接。这将自动处理丢失的值:

    dt[dt[group == "A"], on = "date", ratio := x.value / i.value][]
    
    请注意,这种方法在第24行返回
    NA
    ,因为
    A组中没有相应的日期
    2019-03-06

    基准 由于提出了几种解决方案,我想知道在执行速度和内存消耗方面有什么区别:

  • “分组”:按日期分组的分组
  • “加入”:按照此答案中的建议更新加入
  • “重塑”、“重塑2”:的
    data.table
    版本,但经过大量扩展,以两种风格返回与其他答案相同的结果
  • 对于基准测试,将
    测试台
    包用于不同的问题大小。还模拟了缺失值对日期的影响,这些缺失值会导致某些日期不匹配

    这是通过分别为每个组
    A
    B
    创建单独的数据表来实现的,其行数比指定的问题大小
    n
    多10%。从这两个data.table
    n
    行中,对每一行进行采样,并将其组合到实际的基准数据中。table
    dt0
    dt0
    有2行
    n

    还请注意,每个基准测试运行都从一个新的
    dt0
    副本开始,因为有些方法会修改输入数据

    请注意对数时间刻度

    对于较小规模的问题,小组方法最快。随着问题规模的增加,这种方法被其他方法取代,特别是连接方法。对于n=107,join的速度大约是group的七倍。令人惊讶的是,尽管数据被前后重塑,但重塑方法是第二快的

    虽然该组的速度相对较慢,问题规模较大,但内存占用最少(
    mem\u alloc


    请注意,即使对于最大的问题大小,所有计时都在20秒以下。因此,只有在多次重复该操作时,任何速度差异都可能很重要。

    我仍然不确定您想要的输出是什么。维度是多少?收音机看起来像是B组的值除以A组的相同日期。这是正确的。结果的形状有些不同。
    data.table
    等价物将是
    dcast(dt,date~group)[,ratio:=B/A]
    。不幸的是,答案没有显示输出,这隐藏了一个事实,即答案在第24行返回了1.0的
    比率,尽管在
    A
    组中没有相应的2019-03-06年
    日期。这与此处返回
    NA
    的其他答案形成对比,包括。
    dt[dt[group == "A"], on = "date", ratio := x.value / i.value][]
    
              date group   value     ratio
     1: 2019-03-01     A 43.7425 1.0000000
     2: 2019-03-04     A 43.9625 1.0000000
     3: 2019-03-05     A 43.8825 1.0000000
     4: 2019-03-07     A 43.1250 1.0000000
     5: 2019-03-08     A 43.2275 1.0000000
     6: 2019-03-11     A 44.7250 1.0000000
     7: 2019-03-12     A 45.2275 1.0000000
     8: 2019-03-13     A 45.4275 1.0000000
     9: 2019-03-14     A 45.9325 1.0000000
    10: 2019-03-15     A 46.5300 1.0000000
    11: 2019-03-18     A 47.0050 1.0000000
    12: 2019-03-19     A 46.6325 1.0000000
    13: 2019-03-20     A 47.0400 1.0000000
    14: 2019-03-21     A 48.7725 1.0000000
    15: 2019-03-22     A 47.7625 1.0000000
    16: 2019-03-25     A 47.1850 1.0000000
    17: 2019-03-26     A 46.6975 1.0000000
    18: 2019-03-27     A 47.1175 1.0000000
    19: 2019-03-28     A 47.1800 1.0000000
    20: 2019-03-29     A 47.4875 1.0000000
    21: 2019-03-01     B 12.3100 0.2814197
    22: 2019-03-04     B 12.5100 0.2845607
    23: 2019-03-05     B 12.7000 0.2894092
    24: 2019-03-06     B 12.4000        NA
    25: 2019-03-07     B 12.6300 0.2928696
    26: 2019-03-08     B 12.9300 0.2991151
    27: 2019-03-11     B 13.1800 0.2946898
    28: 2019-03-12     B 13.2300 0.2925211
    29: 2019-03-13     B 13.3500 0.2938749
    30: 2019-03-25     B 14.2700 0.3024266
    31: 2019-03-26     B 14.5000 0.3105091
    32: 2019-03-27     B 14.2500 0.3024354
    33: 2019-03-28     B 13.8800 0.2941925
    34: 2019-03-29     B 13.7100 0.2887076
              date group   value     ratio
    
    library(bench)
    library(ggplot2)
    bm <- press(
      n = 10^(3:7)
      , {
        nx <- as.integer(n * 1.1)
        dates <- seq(as.IDate("1970-01-01"), by = 1L, length.out = nx)
        dtA <- data.table(date = dates,  group = "A", value = (1:nx) * pi)
        dtB <- data.table(date = dates,  group = "B", value = (1:nx) * 2*pi)
        set.seed(123)
        dt0 <- rbind(dtA[sample(nx, n)], dtB[sample(nx, n)])
        setorder(dt0, group, date)
        mark(
          join = {
            dt <- copy(dt0)
            dt[dt[group == "A"], on = "date", ratio := x.value / i.value]
          }, 
          group = {
            dt <- copy(dt0)
            dt[, ratio := value/value[group=="A"] , date]  
          },
          reshape = {
            dt <- copy(dt0)
            dcast(dt, date ~ group)[, c("ratioA", "ratioB") := .(A/A, B/A)][
              , melt(.SD, measure.vars = list(value = c("A", "B"), ratio = c("ratioA", "ratioB")), 
                     variable.name = "group")][
                       !(is.na(value) & is.na(ratio))][
                         , group := c("A", "B")[group]]
          },
          reshape2 = {
            dt <- copy(dt0)
            dcast(dt, date ~ group)[, c("ratioA", "ratioB") := .(rep(1.0, .N), B/A)][
              , melt(.SD, measure.vars = patterns(value = "^[AB]", ratio = "^ratio"), 
                     variable.name = "group")][
                       , group := c("A", "B")[group]][
                         !is.na(value)]
          },
          check = function(x,y) all.equal(x, y, check.attributes = FALSE),
          min_iterations = 3L
        )
      }
    )
    
     bm[, 1:10]
    
    # A tibble: 20 x 10
       expression        n      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
       <bch:expr>    <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
     1 join           1000   3.19ms   3.88ms  243.      355.42KB    2.10    116     1   477.13ms
     2 group          1000   2.27ms   2.81ms  342.      137.24KB    2.05    167     1    488.5ms
     3 reshape        1000   4.74ms   6.26ms  155.      805.56KB    2.12     73     1   472.21ms
     4 reshape2       1000   4.68ms   6.18ms  159.      797.66KB    2.09     76     1    479.4ms
     5 join          10000   4.86ms   5.96ms  161.         1.9MB    0        81     0   503.38ms
     6 group         10000  18.68ms  19.77ms   49.7     930.32KB    4.52     22     2   442.34ms
     7 reshape       10000   9.13ms  11.59ms   83.9       4.29MB    0        42     0   500.68ms
     8 reshape2      10000  10.58ms  12.83ms   76.7       4.21MB    0        39     0   508.78ms
     9 join         100000  23.43ms  28.14ms   35.2      17.41MB    0        18     0   512.06ms
    10 group        100000 187.33ms 192.88ms    5.18      9.13MB    2.59      2     1   385.76ms
    11 reshape      100000   51.8ms  57.67ms   17.4      39.31MB    2.17      8     1   460.72ms
    12 reshape2     100000  50.59ms  56.78ms   17.3      38.55MB    0         9     0   520.46ms
    13 join        1000000 183.66ms 184.12ms    5.40    172.53MB    0         3     0    555.4ms
    14 group       1000000    1.83s    1.98s    0.486    86.12MB    2.43      3    15      6.18s
    15 reshape     1000000 473.52ms 492.22ms    2.05    389.47MB    0         3     0      1.46s
    16 reshape2    1000000 498.48ms 505.92ms    1.97    381.84MB    0         3     0      1.52s
    17 join       10000000    2.01s    2.44s    0.432     1.68GB    0.576     3     4      6.95s
    18 group      10000000   18.41s    18.7s    0.0531   860.1MB    2.76      3   156     56.46s
    19 reshape    10000000    6.07s    6.46s    0.142      3.8GB    0.237     3     5      21.1s
    20 reshape2   10000000    6.01s    6.02s    0.161     3.73GB    0.322     3     6     18.65s
    
    autoplot(bm)