R 提取同一工作表中的多个数据文件
更新: 手动剪切并粘贴到多张图纸中。如果能找到一个解决办法,那就太好了 问题: 给定以下虚拟数据集:R 提取同一工作表中的多个数据文件,r,csv,R,Csv,更新: 手动剪切并粘贴到多张图纸中。如果能找到一个解决办法,那就太好了 问题: 给定以下虚拟数据集: structure(list(V1 = structure(c(8L, 6L, 2L, 4L, 1L, 1L, 1L, 1L, 9L, 5L, 2L, 1L, 1L, 1L, 1L, 10L, 7L, 3L), .Label = c("", "1", "12", "5", "Age", "Class A", "Height", "Number of Boys", "More Boys"
structure(list(V1 = structure(c(8L, 6L, 2L, 4L, 1L, 1L, 1L, 1L,
9L, 5L, 2L, 1L, 1L, 1L, 1L, 10L, 7L, 3L), .Label = c("", "1",
"12", "5", "Age", "Class A", "Height", "Number of Boys", "More Boys",
"More Girls"), class = "factor"), V2 = structure(c(1L, 5L, 3L,
4L, 1L, 1L, 1L, 1L, 1L, 6L, 3L, 1L, 1L, 1L, 1L, 1L, 7L, 2L), .Label = c("",
"12", "2", "6", "Class B", "Time", "Weight"), class = "factor"),
V3 = structure(c(1L, 5L, 3L, 4L, 1L, 1L, 1L, 1L, 1L, 6L,
3L, 1L, 1L, 1L, 1L, 1L, 7L, 2L), .Label = c("", "13", "3",
"7", "Class C", "Next", "Time"), class = "factor"), V4 = structure(c(1L,
5L, 3L, 4L, 1L, 1L, 1L, 1L, 1L, 6L, 3L, 1L, 1L, 1L, 1L, 1L,
6L, 2L), .Label = c("", "14", "4", "8", "Class D", "Day"), class = "factor"),
V5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), V6 = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), V7 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), V8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), V9 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), V10 = structure(c(5L,
4L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("", "1", "8", "Class E", "Number of Girls"
), class = "factor"), V11 = structure(c(1L, 4L, 3L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"2", "8", "Class F"), class = "factor"), V12 = structure(c(1L,
4L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("", "3", "9", "Class G"), class = "factor"),
V13 = structure(c(1L, 4L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "0", "4",
"Class Q"), class = "factor"), V14 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-18L))
这看起来像(被截断的):
正如我们所希望看到的,这些是同一工作表上的独立“文件”。我一直在寻找一种快速的方法来挖掘不同的数据集,但还没有。附近有工作吗?
我的想法是使用基于序列的选择,比如说每20行选择一次,但如果有数百万行,这显然会失败
预期输出(及其各自的行)
提前感谢。使用@alexis_laz的解决方案
库(矩阵)
x[[抑制14个列名'V1','V2','V3'…]]
#>
#> [1,] | . . . . . . . . | . . . .
#> [2,] | | | | . . . . . | | | | .
#> [3,] | | | | . . . . . | | | | .
#> [4,] | | | | . . . . . | | | | .
#> [5,] . . . . . . . . . . . . . .
#> [6,] . . . . . . . . . . . . . .
#> [7,] . . . . . . . . . . . . . .
#> [8,] . . . . . . . . . . . . . .
#> [9,] | . . . . . . . . . . . . .
#> [10,] | | | | . . . . . . . . . .
#> [11,] | | | | . . . . . . . . . .
#> [12,] . . . . . . . . . . . . . .
#> [13,] . . . . . . . . . . . . . .
#> [14,] . . . . . . . . . . . . . .
#> [15,] . . . . . . . . . . . . . .
#> [16,] | . . . . . . . . . . . . .
#> [17,] | | | | . . . . . . . . . .
#> [18,] | | | | . . . . . . . . . .
sm=作为矩阵(汇总(m))
d=距离(sm,“曼哈顿”)
gr=cutree(hclust(d,“单个”),h=1)
res 18 x 13类“dgCMatrix”的稀疏矩阵
#>
#> [1,] 1 . . . . . . . . 4.
#> [2,] 1 1 1 1 . . . . . 4 4 4 4
#> [3,] 1 1 1 1 . . . . . 4 4 4 4
#> [4,] 1 1 1 1 . . . . . 4 4 4 4
#> [5,] . . . . . . . . . . . . .
#> [6,] . . . . . . . . . . . . .
#> [7,] . . . . . . . . . . . . .
#> [8,] . . . . . . . . . . . . .
#> [9,] 2 . . . . . . . . . . . .
#> [10,] 2 2 2 2 . . . . . . . . .
#> [11,] 2 2 2 2 . . . . . . . . .
#> [12,] . . . . . . . . . . . . .
#> [13,] . . . . . . . . . . . . .
#> [14,] . . . . . . . . . . . . .
#> [15,] . . . . . . . . . . . . .
#> [16,] 3 . . . . . . . . . . . .
#> [17,] 3 3 3 3 . . . . . . . . .
#> [18,] 3 3 3 3 . . . . . . . . .
res2$`1`
#>V1 V2 V3 V4
#>1ï?男孩人数
#>2 A类B类C类D类
#> 3 1 2 3 4
#> 4 5 6 7 8
#>
#> $`2`
#>V1 V2 V3 V4
#>还有9个男孩
#>第二天是10岁
#> 11 1 2 3 4
#>
#> $`3`
#>V1 V2 V3 V4
#>还有16个女孩
#>17身高体重时间-天
#> 18 12 12 13 14
#>
#> $`4`
#>V10 V11 V12 V13
#>1女孩人数
#>2 E类F类G类Q类
#> 3 8 8 9 0
#> 4 1 2 3 4
于2019年4月10日由(v0.2.1)使用@alexis_laz在
库(矩阵)
x[[抑制14个列名'V1','V2','V3'…]]
#>
#> [1,] | . . . . . . . . | . . . .
#> [2,] | | | | . . . . . | | | | .
#> [3,] | | | | . . . . . | | | | .
#> [4,] | | | | . . . . . | | | | .
#> [5,] . . . . . . . . . . . . . .
#> [6,] . . . . . . . . . . . . . .
#> [7,] . . . . . . . . . . . . . .
#> [8,] . . . . . . . . . . . . . .
#> [9,] | . . . . . . . . . . . . .
#> [10,] | | | | . . . . . . . . . .
#> [11,] | | | | . . . . . . . . . .
#> [12,] . . . . . . . . . . . . . .
#> [13,] . . . . . . . . . . . . . .
#> [14,] . . . . . . . . . . . . . .
#> [15,] . . . . . . . . . . . . . .
#> [16,] | . . . . . . . . . . . . .
#> [17,] | | | | . . . . . . . . . .
#> [18,] | | | | . . . . . . . . . .
sm=作为矩阵(汇总(m))
d=距离(sm,“曼哈顿”)
gr=cutree(hclust(d,“单个”),h=1)
res 18 x 13类“dgCMatrix”的稀疏矩阵
#>
#> [1,] 1 . . . . . . . . 4.
#> [2,] 1 1 1 1 . . . . . 4 4 4 4
#> [3,] 1 1 1 1 . . . . . 4 4 4 4
#> [4,] 1 1 1 1 . . . . . 4 4 4 4
#> [5,] . . . . . . . . . . . . .
#> [6,] . . . . . . . . . . . . .
#> [7,] . . . . . . . . . . . . .
#> [8,] . . . . . . . . . . . . .
#> [9,] 2 . . . . . . . . . . . .
#> [10,] 2 2 2 2 . . . . . . . . .
#> [11,] 2 2 2 2 . . . . . . . . .
#> [12,] . . . . . . . . . . . . .
#> [13,] . . . . . . . . . . . . .
#> [14,] . . . . . . . . . . . . .
#> [15,] . . . . . . . . . . . . .
#> [16,] 3 . . . . . . . . . . . .
#> [17,] 3 3 3 3 . . . . . . . . .
#> [18,] 3 3 3 3 . . . . . . . . .
res2$`1`
#>V1 V2 V3 V4
#>1ï?男孩人数
#>2 A类B类C类D类
#> 3 1 2 3 4
#> 4 5 6 7 8
#>
#> $`2`
#>V1 V2 V3 V4
#>还有9个男孩
#>第二天是10岁
#> 11 1 2 3 4
#>
#> $`3`
#>V1 V2 V3 V4
#>还有16个女孩
#>17身高体重时间-天
#> 18 12 12 13 14
#>
#> $`4`
#>V10 V11 V12 V13
#>1女孩人数
#>2 E类F类G类Q类
#> 3 8 8 9 0
#> 4 1 2 3 4
由(v0.2.1)创建于2019-04-10以下代码创建了一个列名称设置正确的data.frames列表。然而,它依赖于这样一个事实:在您的工作表中,“表的列”至少由一个空列分隔
df <- apply(df, 2, function(x) gsub("^$|^ $", NA, x))
empty_cols <- sapply(1:ncol(df), function(i){length(which(is.na(df[, i])))==nrow(df)})
start_cols <- c(1, which(diff(empty_cols)==-1)+1)
if (is.na(df[1, 1])) start_cols <- start_cols[-1]
start_rows <- lapply(start_cols, function(i){
start_rows <- c(1, which(diff(is.na(df[, i]))==-1)+1)
if (is.na(df[1, i])) start_rows <- start_rows[-1]
start_rows})
end_rows <- lapply(start_cols, function(i){
end_rows <- c(1, which(diff(is.na(df[, i]))==1))
if (!is.na(df[nrow(df), i])) end_rows <- c(end_rows, nrow(df))
end_rows[-1]})
data.sets <- list()
for (i in 1:length(start_cols)) {
for (j in 1:length(start_rows[[i]])){
col <- start_cols[i]
row <- start_rows[[i]][j]
start_row <- row+1
end_row <- end_rows[[i]][j]
name <- df[row, col]
ncol <- which(diff(is.na(df[row+1, col:ncol(df)]))==1)[1]
end_col <- col+ncol-1
column_names <- df[start_row, col:end_col]
data <- df[(start_row+1):end_row, col:end_col]
data <- matrix(data, ncol = length(col:end_col))
data <- as.data.frame(data)
names(data) <- column_names
data.sets[[name]] <- data
}
}
> data.sets
$`Number of Boys`
Class A Class B Class C Class D
1 1 2 3 4
2 5 6 7 8
$`More Boys`
Age Time Next Day
1 1 2 3 4
$`More Girls`
Height Weight Time Day
1 12 12 13 14
$`Number of Girls`
Class E Class F Class G Class Q
1 8 8 9 0
2 1 2 3 4
df以下代码创建了列名称设置正确的data.frames列表。然而,它依赖于这样一个事实:在您的工作表中,“表列”是separa
Three data sets:
A: Number of Boys
B: Number of Girls
C: More Boys
df <- apply(df, 2, function(x) gsub("^$|^ $", NA, x))
empty_cols <- sapply(1:ncol(df), function(i){length(which(is.na(df[, i])))==nrow(df)})
start_cols <- c(1, which(diff(empty_cols)==-1)+1)
if (is.na(df[1, 1])) start_cols <- start_cols[-1]
start_rows <- lapply(start_cols, function(i){
start_rows <- c(1, which(diff(is.na(df[, i]))==-1)+1)
if (is.na(df[1, i])) start_rows <- start_rows[-1]
start_rows})
end_rows <- lapply(start_cols, function(i){
end_rows <- c(1, which(diff(is.na(df[, i]))==1))
if (!is.na(df[nrow(df), i])) end_rows <- c(end_rows, nrow(df))
end_rows[-1]})
data.sets <- list()
for (i in 1:length(start_cols)) {
for (j in 1:length(start_rows[[i]])){
col <- start_cols[i]
row <- start_rows[[i]][j]
start_row <- row+1
end_row <- end_rows[[i]][j]
name <- df[row, col]
ncol <- which(diff(is.na(df[row+1, col:ncol(df)]))==1)[1]
end_col <- col+ncol-1
column_names <- df[start_row, col:end_col]
data <- df[(start_row+1):end_row, col:end_col]
data <- matrix(data, ncol = length(col:end_col))
data <- as.data.frame(data)
names(data) <- column_names
data.sets[[name]] <- data
}
}
> data.sets
$`Number of Boys`
Class A Class B Class C Class D
1 1 2 3 4
2 5 6 7 8
$`More Boys`
Age Time Next Day
1 1 2 3 4
$`More Girls`
Height Weight Time Day
1 12 12 13 14
$`Number of Girls`
Class E Class F Class G Class Q
1 8 8 9 0
2 1 2 3 4