R 一个数据帧的多次子集_R_Dataframe_Subset

R 一个数据帧的多次子集

r dataframe

R 一个数据帧的多次子集,r,dataframe,subset,R,Dataframe,Subset,我想将由20个变量（连续变量和分类变量）组成的数据帧子集100倍，分为两部分，分别代表70%和30%。但我也可以用iris数据集作为例子 data(iris) test.rows <- sample(1:nrow(iris), 105) iris.70 <- iris[test.rows, ] iris.30 <- iris[-test.rows, ] 数据（iris） test.rows首先创建100个样本： samples <- list() for(i in

我想将由20个变量（连续变量和分类变量）组成的数据帧子集100倍，分为两部分，分别代表70%和30%。但我也可以用iris数据集作为例子

data(iris)

test.rows <- sample(1:nrow(iris), 105)
iris.70 <- iris[test.rows, ]
iris.30 <- iris[-test.rows, ]

数据（iris）
test.rows首先创建100个样本：
samples <- list()

for(i in 1:100){
  samples[[i]] <- sample(1:nrow(surveyed100), 246)
}


使用虹膜的示例：
samples <- list()

for(i in 1:100){
  samples[[i]] <- sample(1:nrow(iris), 105)
}

head(samples)
[[1]]
  [1]  66 106  39  50  33 123  68  62  65 125  30  25  60  70  49  98 140  44 141  94  18  59 117  32  63 133  16 139  97 145 105  78 112  95
 [35] 128  36  37  64  10 124  40 111  17  29  51  89  99   4 135 103 101  19 115  74  73  91  11  67  84  88   1 114 138  21  77  24  69  13
 [69]  53  58 110 150   9  31 144  54 129  34  35  52 142  14 113 127  27  20  87 134 118  15  72  92  75   8 104  96 136 143   2  41 109  90
[103] 146  26   6

[[2]]
  [1]  78  84  89  75  63  81 119  51 127  20  66 106 140  65 116  72 147 141  61 113 130 136 109  49  57 149  90  56   8  46  82  55  38   4
 [35]  70  94 100 117  95  29  45  13 128  11  83  80  35  41 121  73  39  67  19  98 108 103  42   2  44 132 114 137 118  12 125  24  77  53
 [69]  28 150  92   5  43 112  60 122  15  30 104 102 120  76  47  85  40  79  33 143  48 139 148 124  36  16 138 101 115 107 134 126  74   6
[103]  52  50  10

[[3]]
  [1]  23  67  54 131  84 146  25   7  41 101 138  49  28  95  15   5  57  69 126  60  12  92  35  89  50   1  13  77 140 116 136  17 144  64
 [35]  32 139  76 102  61 130   2  44  75 100  81  31  34  46  72  33  18  79  24 133 124  62   9  88   8  66  74 125  51 127 123  52  90  39
 [69] 120  42  16  83  40 137  47  58  82 135  96  20 119  91  36  48 132  55  93 106 107 109 113  53  19 141 105 128  78 143  29   4  45  37
[103]  73  94  87

[[4]]
  [1] 125  41  37  80 136  50  91  89  44 117 132  82  78 128 146  49  61 105 145  83 111 126 100  94   7 102 112  17 120  60  36 104 123  65
 [35]  48  34  45  73  25  46 110  74  66 137 107 101 106  24  97  18 119  72  33 134  87  35 121  14  88   9  39   8  64 142  10 148  54  99
 [69] 103  95  63  11 133 141  32  96  51  81 140  76 138 127  52  75  55  26 115  19  90  16  21  86  56  22  79  53  31  23  68  13  77  30
[103]  71 116  67

[[5]]
  [1]  83   4  85 133 111  55 145  65  81  50 136  64  13  27   5 117  33  69  40 127  80  61  53 125  77  36 124 140 138  86   7   6  79  29
 [35]  21 115  23  74  93  10 132  51   2  41  49 123  94 142 120  48  19  89  28  91  14 118  43 103  87  58 149  20  56 113  82  62 104  44
 [69]  72  47 119  35 143 116 128  26  75  88   9  60  16 130 114  31   1 147  78  73   3  32  70 146 131 102  15  54 141 129  42 101  17  59
[103]  46 134 110

[[6]]
  [1]  18  20  53 106 142 125 120 109 119 129  84 146  99  51  43  91 141  89 131 124  95 135  81  42  73 112 128 133 108  27  28  47  32  76
 [35] 130 138  70  36  10  90  16  11 137  17  87   5  35  25 123  97  12 115 127  94  34 103   4  54 134  78  68  71 101 126  61  37  33   2
 [69]  88  80 144  82 150   3  21 114  58 110 136  22 105 117  79  64 102  49  98  59 132  39   8 149 121  40  29 104  55  77 147  74  50  56
[103]  48  75  23

您可以构建一个小函数来完成此操作，例如：
foo <- function(dat, train_percent = 0.7) {
  n     <- seq_len(nrow(dat))
  train <- sample(n, floor(train_percent * max(n)))
  test  <- sample(setdiff(n, train))
  list(train = dat[train,], test = dat[test,])
}

结果列表有100个元素，每个元素本身就是两个元素的列表，其中第一个是“train”，第二个是“test”数据集
 嘿，非常感谢！我对函数知之甚少。这似乎很有用。有没有办法知道其他30%未使用的数据帧发生了什么情况？@DiegoGuevaraTorres，这两个数据帧都存储在列表中。第一个（本例中为70%）称为列车
，30%称为测试
。老实说，我会使用这个解决方案而不是简单的方法。它非常方便。我自己将它保存在自定义函数中，以备以后使用：）谢谢，利奥，这似乎非常方便。有没有一种方法可以使用这些数据帧来运行模型？我实际上想获得AUC值Hey Diego，你可以使用列表中的数据帧，就像你使用普通数据帧一样，你只需要从列表中调用它们。在一个模型中，您可能会这样做：lm（x~y，data=output[[1]][[1]]）
您甚至可以像@docendodiscimus在回答中那样命名这两个数据帧，然后用output[[1]]$train或类似的东西调用它们。
samples <- list()

for(i in 1:100){
  samples[[i]] <- sample(1:nrow(iris), 105)
}

head(samples)
[[1]]
  [1]  66 106  39  50  33 123  68  62  65 125  30  25  60  70  49  98 140  44 141  94  18  59 117  32  63 133  16 139  97 145 105  78 112  95
 [35] 128  36  37  64  10 124  40 111  17  29  51  89  99   4 135 103 101  19 115  74  73  91  11  67  84  88   1 114 138  21  77  24  69  13
 [69]  53  58 110 150   9  31 144  54 129  34  35  52 142  14 113 127  27  20  87 134 118  15  72  92  75   8 104  96 136 143   2  41 109  90
[103] 146  26   6

[[2]]
  [1]  78  84  89  75  63  81 119  51 127  20  66 106 140  65 116  72 147 141  61 113 130 136 109  49  57 149  90  56   8  46  82  55  38   4
 [35]  70  94 100 117  95  29  45  13 128  11  83  80  35  41 121  73  39  67  19  98 108 103  42   2  44 132 114 137 118  12 125  24  77  53
 [69]  28 150  92   5  43 112  60 122  15  30 104 102 120  76  47  85  40  79  33 143  48 139 148 124  36  16 138 101 115 107 134 126  74   6
[103]  52  50  10

[[3]]
  [1]  23  67  54 131  84 146  25   7  41 101 138  49  28  95  15   5  57  69 126  60  12  92  35  89  50   1  13  77 140 116 136  17 144  64
 [35]  32 139  76 102  61 130   2  44  75 100  81  31  34  46  72  33  18  79  24 133 124  62   9  88   8  66  74 125  51 127 123  52  90  39
 [69] 120  42  16  83  40 137  47  58  82 135  96  20 119  91  36  48 132  55  93 106 107 109 113  53  19 141 105 128  78 143  29   4  45  37
[103]  73  94  87

[[4]]
  [1] 125  41  37  80 136  50  91  89  44 117 132  82  78 128 146  49  61 105 145  83 111 126 100  94   7 102 112  17 120  60  36 104 123  65
 [35]  48  34  45  73  25  46 110  74  66 137 107 101 106  24  97  18 119  72  33 134  87  35 121  14  88   9  39   8  64 142  10 148  54  99
 [69] 103  95  63  11 133 141  32  96  51  81 140  76 138 127  52  75  55  26 115  19  90  16  21  86  56  22  79  53  31  23  68  13  77  30
[103]  71 116  67

[[5]]
  [1]  83   4  85 133 111  55 145  65  81  50 136  64  13  27   5 117  33  69  40 127  80  61  53 125  77  36 124 140 138  86   7   6  79  29
 [35]  21 115  23  74  93  10 132  51   2  41  49 123  94 142 120  48  19  89  28  91  14 118  43 103  87  58 149  20  56 113  82  62 104  44
 [69]  72  47 119  35 143 116 128  26  75  88   9  60  16 130 114  31   1 147  78  73   3  32  70 146 131 102  15  54 141 129  42 101  17  59
[103]  46 134 110

[[6]]
  [1]  18  20  53 106 142 125 120 109 119 129  84 146  99  51  43  91 141  89 131 124  95 135  81  42  73 112 128 133 108  27  28  47  32  76
 [35] 130 138  70  36  10  90  16  11 137  17  87   5  35  25 123  97  12 115 127  94  34 103   4  54 134  78  68  71 101 126  61  37  33   2
 [69]  88  80 144  82 150   3  21 114  58 110 136  22 105 117  79  64 102  49  98  59 132  39   8 149 121  40  29 104  55  77 147  74  50  56
[103]  48  75  23

output <- lapply(samples, function(x) list(iris[x,], iris[-x,]))

head(output[[1]][[1]])
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
66           6.7         3.1          4.4         1.4 versicolor
106          7.6         3.0          6.6         2.1  virginica
39           4.4         3.0          1.3         0.2     setosa
50           5.0         3.3          1.4         0.2     setosa
33           5.2         4.1          1.5         0.1     setosa
123          7.7         2.8          6.7         2.0  virginica

head(output[[1]][[2]])
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3           4.7         3.2          1.3         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
7           4.6         3.4          1.4         0.3  setosa
12          4.8         3.4          1.6         0.2  setosa
22          5.1         3.7          1.5         0.4  setosa
23          4.6         3.6          1.0         0.2  setosa


> nrow(output[[1]][[1]])
[1] 105

> nrow(output[[1]][[2]])
[1] 45

foo <- function(dat, train_percent = 0.7) {
  n     <- seq_len(nrow(dat))
  train <- sample(n, floor(train_percent * max(n)))
  test  <- sample(setdiff(n, train))
  list(train = dat[train,], test = dat[test,])
}

replicate(100, foo(iris), simplify = FALSE)