R中dplyr管道中按组的线性插值（近似值）_R_Dplyr_Linear Interpolation

R中dplyr管道中按组的线性插值（近似值）

R中dplyr管道中按组的线性插值（近似值）,r,dplyr,linear-interpolation,R,Dplyr,Linear Interpolation,我有一个问题，我觉得很难用MRE和简单的语言来解释答案很简单，主要是因为我不完全理解问题所在我自己所以我很抱歉说得含糊不清我有一个带有许多样本和参考测量值的TIBLE，我想要它对每个样本进行线性插值。我现在通过取出所有参考测量值，使用大约，然后将其补回。但因为我先把它拿出来，我无法通过dplyr管道在组中很好地完成此操作。现在我用一个我将新创建的空列添加到对tibble进行采样，然后使用for循环进行采样所以我的问题是：我如何在团队中实现近似部分进入管道，这样我可以在小组内

我有一个问题，我觉得很难用MRE和简单的语言来解释答案很简单，主要是因为我不完全理解问题所在我自己所以我很抱歉说得含糊不清

我有一个带有许多样本和参考测量值的TIBLE，我想要它对每个样本进行线性插值。我现在通过取出所有参考测量值，使用大约，然后将其补回。但因为我先把它拿出来，我无法通过dplyr管道在组中很好地完成此操作。现在我用一个我将新创建的空列添加到对tibble进行采样，然后使用for循环进行采样

所以我的问题是：我如何在团队中实现近似部分进入管道，这样我可以在小组内做任何事情？我做过实验使用dplyr:：do，并在使用dplyr编程时遇到了小插曲，但是搜索主要是给我扫帚：：增强和lm的东西，我认为运作不同地e、 g.见 . 这条线索似乎也很有希望：

irc上有人建议使用条件突变，但我还不完全理解在这个上下文中的位置和方式

我认为问题在于我想过滤掉部分数据对于以下变异操作，但变异操作依赖于我刚刚过滤掉的分组数据，如果有意义的话

这里有一个MWE：

library(tidyverse) # or just dplyr, tibble

# create fake data
data <- data.frame(
  # in reality a dttm with the measurement time
  timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
  # measurement cycle, normally 40 for sample, 41 for reference
  cycle = rep(c(rep(1:3, 2), 4), 3),
  # wheather the measurement is a reference or a sample
  isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
  # measurement intensity for mass 44
  r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
  # measurement intensity for mass 45, normally also masses up to mass 49
  mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...

# overview plot
data %>%
  ggplot(aes(x = cycle, y = r44, colour = isref)) +
  geom_line() +
  geom_line(aes(y = r45), linetype = 2) +
  geom_point() +
  geom_point(aes(y = r45), shape = 1) +
  facet_grid(~ timestamp)

# what I would like to do
data %>%
  group_by(timestamp) %>%
  do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
    y = data %>% filter(isref) %>% pull(cycle),
    xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
  unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.

# here's my current attempt for one of the timestamps
matchref <- function(dat) {
  # split the data into sample gas and reference gas
  ref <- filter(dat, isref)
  smp <- filter(dat, !isref)

  # calculate the "target cycle", the points at which the reference intensity
  # 44 matches the sample intensity 44 with linear interpolation
  target_cycle <- approx(x = ref$r44,
    y = ref$cycle, xout = smp$r44)

  # append the target cycle to the sample gas
  smp <- smp %>%
    group_by(timestamp) %>%
    mutate(target = target_cycle$y)

  # linearly interpolate each reference gas to the target cycle
  ref <- ref %>%
    group_by(timestamp) %>%
    # this is needed because the reference has one more cycle
    mutate(target = c(target_cycle$y, NA)) %>%
    # filter out all the failed ones (no interpolation possible)
    filter(!is.na(target)) %>%
    # calculate interpolated value based on r44 interpolation (i.e., don't
    # actually interpolate this value but shift it based on the 44
    # interpolation)
    mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
      r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
    select(timestamp, target, r44:r45)

  # add new reference gas intensities to the correct sample gasses by the target cycle
  left_join(smp, ref, by = c("time", "target"))
}

matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop

# for which I create a copy of data that has the new columns to be created
mr <- data %>%
  # filter the sample gasses (since we convert ref to sample)
  filter(!isref) %>%
  # add empty new columns
  mutate(target = NA, r44 = NA, r45 = NA)

# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
  mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}

下面是一种将引用和示例传播到新列的方法。为了简单起见，我在本例中删除了r45

  data %>% 
    select(-r45) %>% 
    mutate(isref = ifelse(isref, "REF", "SAMP")) %>% 
    spread(isref, r44) %>% 
    group_by(timestamp) %>% 
    mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>% 
    ungroup

给

  # timestamp      cycle  REF  SAMP target_cycle
  # <fct>     <dbl> <dbl> <dbl>        <dbl>
  # 1  a             1    30    28          3  
  # 2  a             2    29    27          4  
  # 3  a             3    28    26         NA  
  # 4  a             4    27    NA         NA  
  # 5  b             1    31    26         NA  
  # 6  b             2    38    36          2.5
  # 7  b             3    34    33          4  
  # 8  b             4    33    NA         NA  
  # 9  c             1    15    31         NA  
  # 10 c             2    19    18          3  
  # 11 c             3    18    16          2.5
  # 12 c             4    17    NA         NA

给予

# # A tibble: 12 x 7
#    timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct>        <dbl>   <dbl>    <dbl>   <dbl>    <dbl>        <dbl>
# 1  a             1      30       28    49.5     47.2          3  
# 2  a             2      29       27    48.8     48.7          4  
# 3  a             3      28       26    47.2     46.8         NA  
# 4  a             4      27       NA    47.9     NA           NA  
# 5  b             1      31       26    51.4     45.7         NA  
# 6  b             2      38       36    57.5     55.9          2.5
# 7  b             3      34       33    54.3     52.4          4  
# 8  b             4      33       NA    52.0     NA           NA  
# 9  c             1      15       31    36.0     51.7         NA  
# 10 c             2      19       18    39.1     37.9          3  
# 11 c             3      18       16    39.2     35.3          2.5
# 12 c             4      17       NA    39.0     NA           NA

当样本值超出参考范围时会发生什么情况？例如，在时间戳a中，参考范围为27到30，但r44的值为26。它应该被外推还是返回`NA？我认为它应该返回NA。否则我可能会使用Hmisc:：approxExtrap。这是一个很好的开始！然而，我现在不知道如何让r45返回。如果我不排除它，我会在REF和SAMP中得到交替的值和NA，并且我不能再根据isref来传播它，因为在第一次传播调用时，它已经消失了。我应该使用重塑的熔化功能来一次分散多个列还是什么？非常感谢！我还能够通过这个实现第二个近似调用，它对我的实际数据有效！此外，我现在明白了一个很酷的技巧，可以在整洁和广泛的数据之间快速来回切换：。@Japhir很高兴听到它有帮助！

# # A tibble: 12 x 7
#    timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct>        <dbl>   <dbl>    <dbl>   <dbl>    <dbl>        <dbl>
# 1  a             1      30       28    49.5     47.2          3  
# 2  a             2      29       27    48.8     48.7          4  
# 3  a             3      28       26    47.2     46.8         NA  
# 4  a             4      27       NA    47.9     NA           NA  
# 5  b             1      31       26    51.4     45.7         NA  
# 6  b             2      38       36    57.5     55.9          2.5
# 7  b             3      34       33    54.3     52.4          4  
# 8  b             4      33       NA    52.0     NA           NA  
# 9  c             1      15       31    36.0     51.7         NA  
# 10 c             2      19       18    39.1     37.9          3  
# 11 c             3      18       16    39.2     35.3          2.5
# 12 c             4      17       NA    39.0     NA           NA