如何基于控件数据集编写一个函数来剪切R中的许多列_R_Dataframe_Dplyr

如何基于控件数据集编写一个函数来剪切R中的许多列

r dataframe

如何基于控件数据集编写一个函数来剪切R中的许多列,r,dataframe,dplyr,R,Dataframe,Dplyr,我有一个化学暴露的数据框，看起来像这样： chem1 chem2 chem3 ... chem524 .06 6.8 .3 .2 .7 24.3 NA .7 .4 2.9 .03 1.6 chem1_cut chem2_cut chem3_cut ... (-inf, 0.1] (0.1, 12.1] (0.1, 12.1]

我有一个化学暴露的数据框，看起来像这样：

  chem1 chem2 chem3 ... chem524
  .06   6.8    .3        .2
  .7    24.3    NA       .7
  .4    2.9    .03       1.6

      chem1_cut      chem2_cut     chem3_cut ...
      (-inf, 0.1]  (0.1, 12.1]  (0.1, 12.1]       
      (0.1, 12.1]  (12.1, inf]     NA      
      (0.1, 12.1]  (0.1, 12.1]  (-inf, 0.1]

我需要根据暴露值将每种化学品的连续值转换为类别。值的分布非常倾斜，有许多0值和一些非常高的值。这些切割需要基于包含控件的数据集子集，该子集与上面的控件类似。输出应如下所示：

  chem1 chem2 chem3 ... chem524
  .06   6.8    .3        .2
  .7    24.3    NA       .7
  .4    2.9    .03       1.6

      chem1_cut      chem2_cut     chem3_cut ...
      (-inf, 0.1]  (0.1, 12.1]  (0.1, 12.1]       
      (0.1, 12.1]  (12.1, inf]     NA      
      (0.1, 12.1]  (0.1, 12.1]  (-inf, 0.1]

对每种化学品都使用了如下切割函数：

chem_dat$chem_1 <- cut(chem_dat$chem_1 , breaks=c(-Inf, quantile(control_chem_dat$chem_1 , probs=c( 0.5,0.75), na.rm=TRUE), Inf))

如何更正此函数以执行所需操作？或者，有没有更好的方法来完成这个任务

谢谢你的帮助

试试这个基本的R解决方案。人们认为化学数据和控制化学数据是分离的数据帧。在本例中，我设置了相同的值，但您可以替换。希望这能有所帮助：

#Data
chem_dat <- structure(list(chem1 = c(0.06, 0.7, 0.4), chem2 = c(6.8, 24.3,2.9),
                    chem3 = c(0.3, NA, 0.03), chem524 = c(0.2, 0.7, 1.6)),
               class = "data.frame", row.names = c(NA,-3L))
#Data
control_chem_dat <- structure(list(chem1 = c(0.06, 0.7, 0.4), chem2 = c(6.8, 24.3,2.9),
                    chem3 = c(0.3, NA, 0.03), chem524 = c(0.2, 0.7, 1.6)),
               class = "data.frame", row.names = c(NA,-3L))
#Function
cut_func <- function(x,y)
{
  z <- cut(y,breaks=c(-Inf, quantile(x , probs=c( 0.5,0.75), na.rm=TRUE), Inf))
  return(z)
}
#Apply
Result <- as.data.frame(mapply(cut_func,control_chem_dat,chem_dat))

        chem1       chem2        chem3     chem524
1  (-Inf,0.4]  (-Inf,6.8] (0.232, Inf]  (-Inf,0.7]
2 (0.55, Inf] (15.6, Inf]         <NA>  (-Inf,0.7]
3  (-Inf,0.4]  (-Inf,6.8] (-Inf,0.165] (1.15, Inf]

试试这个基本的解决方案。人们认为化学数据和控制化学数据是分离的数据帧。在本例中，我设置了相同的值，但您可以替换。希望这能有所帮助：

#Data
chem_dat <- structure(list(chem1 = c(0.06, 0.7, 0.4), chem2 = c(6.8, 24.3,2.9),
                    chem3 = c(0.3, NA, 0.03), chem524 = c(0.2, 0.7, 1.6)),
               class = "data.frame", row.names = c(NA,-3L))
#Data
control_chem_dat <- structure(list(chem1 = c(0.06, 0.7, 0.4), chem2 = c(6.8, 24.3,2.9),
                    chem3 = c(0.3, NA, 0.03), chem524 = c(0.2, 0.7, 1.6)),
               class = "data.frame", row.names = c(NA,-3L))
#Function
cut_func <- function(x,y)
{
  z <- cut(y,breaks=c(-Inf, quantile(x , probs=c( 0.5,0.75), na.rm=TRUE), Inf))
  return(z)
}
#Apply
Result <- as.data.frame(mapply(cut_func,control_chem_dat,chem_dat))

        chem1       chem2        chem3     chem524
1  (-Inf,0.4]  (-Inf,6.8] (0.232, Inf]  (-Inf,0.7]
2 (0.55, Inf] (15.6, Inf]         <NA>  (-Inf,0.7]
3  (-Inf,0.4]  (-Inf,6.8] (-Inf,0.165] (1.15, Inf]

您可以使用lappy生成列索引，并将该函数应用于化学数据的每一列。使用索引的优点是，如果列的顺序相同，您还可以为control_chem_dat编制索引。这将为每一列生成一个包含条目的列表，您可以使用cbind将其绑定到data.frame：

chem_cut_list <- lapply(seq_len(ncol(chem_dat)), 2, function(i)  {
  cut(chem_dat[, i] , breaks=c(-Inf, quantile(control_chem_dat[, i],
                                              probs=c( 0.5,0.75), na.rm=TRUE), Inf))
})

chem_cut <- do.call("cbind", chem_cut_list)

chem_cut_list <- lapply(seq_len(ncol(chem_dat)), 2, function(i)  {
  cut(chem_dat[, i] , breaks=c(-Inf, quantile(control_chem_dat[, i],
                                              probs=c( 0.5,0.75), na.rm=TRUE), Inf))
})

chem_cut <- do.call("cbind", chem_cut_list)

在purrr中，有一个map2*函数可以同时迭代多个参数。当data.frame提供给map*时，它将遍历列。让我们用一个示例数据集进行尝试：

library(purrr)
set.seed(555)

control_chem_dat = data.frame(matrix(runif(10*3,min=0,max=0.5),ncol=3))
colnames(control_chem_dat) = paste0("chem",1:3)

chem_dat = data.frame(matrix(runif(5*3,min=0,max=1),ncol=3))
colnames(chem_dat) = paste0("chem",1:3)

编写一个函数来执行此任务，给定x，y，就像您所做的那样：

cut_y_by_x = function(x,y){
   cut(y,c(-Inf, quantile(x , probs=c(0.5,0.75), na.rm=TRUE),+Inf))
}

在base R中，我们这样做是为了让您可以在purrr中看到并行：

mapply(cut_y_by_x,control_chem_dat,chem_dat)

让我们在purrr中这样做：

map2_dfc(control_chem_dat,chem_dat,cut_y_by_x)
# A tibble: 5 x 3
  chem1         chem2        chem3       
  <fct>         <fct>        <fct>       
1 (0.453, Inf]  (-Inf,0.27]  (0.432, Inf]
2 (0.403,0.453] (0.351, Inf] (-Inf,0.383]
3 (0.453, Inf]  (0.351, Inf] (0.432, Inf]
4 (0.403,0.453] (0.27,0.351] (0.432, Inf]
5 (0.453, Inf]  (-Inf,0.27]  (-Inf,0.383]

在purrr中，有一个map2*函数可以同时迭代多个参数。当data.frame提供给map*时，它将遍历列。让我们用一个示例数据集进行尝试：

library(purrr)
set.seed(555)

control_chem_dat = data.frame(matrix(runif(10*3,min=0,max=0.5),ncol=3))
colnames(control_chem_dat) = paste0("chem",1:3)

chem_dat = data.frame(matrix(runif(5*3,min=0,max=1),ncol=3))
colnames(chem_dat) = paste0("chem",1:3)

编写一个函数来执行此任务，给定x，y，就像您所做的那样：

cut_y_by_x = function(x,y){
   cut(y,c(-Inf, quantile(x , probs=c(0.5,0.75), na.rm=TRUE),+Inf))
}

在base R中，我们这样做是为了让您可以在purrr中看到并行：

mapply(cut_y_by_x,control_chem_dat,chem_dat)

让我们在purrr中这样做：

map2_dfc(control_chem_dat,chem_dat,cut_y_by_x)
# A tibble: 5 x 3
  chem1         chem2        chem3       
  <fct>         <fct>        <fct>       
1 (0.453, Inf]  (-Inf,0.27]  (0.432, Inf]
2 (0.403,0.453] (0.351, Inf] (-Inf,0.383]
3 (0.453, Inf]  (0.351, Inf] (0.432, Inf]
4 (0.403,0.453] (0.27,0.351] (0.432, Inf]
5 (0.453, Inf]  (-Inf,0.27]  (-Inf,0.383]

这是一个很好的解决方案，但我不确定断裂是否应该基于另一个df控制化学的分位数_dat@starja它根据每个列的值获取分位数。没有其他的数据帧，只有您拥有的数据帧。这也是我的第一个想法，但在@Justin Andrew的代码中，他实际上使用了control_chem_dat的分位数来表示chem中数据的中断_dat@starja在这种情况下，应提供第二个数据帧。但我知道你必须只使用一个数据帧。这是一个很好的解决方案，但我不确定这些中断是否应该基于另一个df控制化学的分位数_dat@starja它根据每个列的值获取分位数。没有其他的数据帧，只有您拥有的数据帧。这也是我的第一个想法，但在@Justin Andrew的代码中，他实际上使用了control_chem_dat的分位数来表示chem中数据的中断_dat@starja在这种情况下，应提供第二个数据帧。但是我知道你必须只使用一个数据帧。化学和控制化学不是同一个数据帧，对吗？如果是这种情况，请提供control_chem_dat的示例是的，control_chem_dat是一个格式完全相同的不同data.frame。谢谢你的建议。化学数据和控制化学数据不是相同的数据。帧，对吗？如果是这种情况，请提供control_chem_dat的示例是的，control_chem_dat是一个格式完全相同的不同data.frame。谢谢你的建议。