如何将data.frame解析为树?
下面是一个简单的分类法(标签和ID):如何将data.frame解析为树?,r,parsing,tree,R,Parsing,Tree,下面是一个简单的分类法(标签和ID): 也许不是最有效的,但也不是太难: 创建数据: test_data <- data.frame( cat_id = c(661, 197, 228, 650, 126, 912, 949, 428), cat_h1 = c(rep("Animals", 5), rep("Plants", 3)), cat_h2 = c(rep("Mammals", 3), rep("Birds", 2), c("Wheat", "Grass", "Othe
也许不是最有效的,但也不是太难: 创建数据:
test_data <- data.frame(
cat_id = c(661, 197, 228, 650, 126, 912, 949, 428),
cat_h1 = c(rep("Animals", 5), rep("Plants", 3)),
cat_h2 = c(rep("Mammals", 3), rep("Birds", 2), c("Wheat", "Grass", "Other")),
cat_h3 = c("Dogs", "Dogs", "Other", "Hawks", "Other", rep(NA, 3)),
cat_h4 = c("Big", "Little", rep(NA, 6)))
test_data我会避免使用列表结构而不是整洁的数据。下面是一种减少数据冗余的方法
library(dplyr)
h1_h2 =
test_data %>%
select(cat_h1, cat_h2) %>%
distinct %>%
filter(cat_h2 %>% is.na %>% `!`)
h2_h3 =
test_data %>%
select(cat_h2, cat_h3) %>%
distinct %>%
filter(cat_h3 %>% is.na %>% `!`)
h3_h4 =
test_data %>%
select(cat_h3, cat_h4) %>%
distinct %>%
filter(cat_h4 %>% is.na %>% `!`)
原稿可以很容易地重新组合:
h1_h2 %>%
left_join(h2_h3) %>%
left_join(h3_h4)
编辑:这里有一种自动化整个过程的方法
library(dplyr)
library(lazyeval)
adjacency = function(data) {
adjacency_table = function(data, larger_name, smaller_name)
lazy(data %>%
select(larger_name, smaller_name) %>%
distinct %>%
filter(smaller_name %>% is.na %>% `!`) ) %>%
interp(larger_name = larger_name %>% as.name,
smaller_name = smaller_name %>% as.name) %>%
lazy_eval %>%
setNames(c("larger", "smaller"))
data_frame(smaller_name = data %>% names) %>%
mutate(larger_name = smaller_name %>% lag) %>%
slice(-1) %>%
group_by(larger_name, smaller_name) %>%
do(adjacency_table(data, .$larger_name, .$smaller_name) )
}
result =
test_data %>%
select(-cat_id) %>%
adjacency
如果您对顺序的轻微更改感到满意,则这是一个按列处理的递归解决方案:
f <- function(x, d=cbind(x,NA)) {
c(
# call f by branch
if(ncol(d) > 3) local({
x <- d[!is.na(d[[3]]),]
by( x[-2], droplevels(x[2]), f, x=NA, simplify=FALSE)
}),
# leaf nodes
setNames(as.list(d[[1]]), d[[2]])[is.na(d[[3]])]
)
}
但这根本不是OP想要的。我可以理解“这不是一个很好的方法,这更好”,但这似乎与主题有很大的出入…@BenBolker这在技术上是离题的,但实际上(碰巧?)预见到了我的迫切需要,即以邻接列表形式重新表示树(与原始的“列沿袭”形式相反)!我可以看到这是通用的,用“lappy”包装,然后通过管道连接到“bind_rows”。也许离“减少”只有一步之遥。但是---这在OP中没有体现---如果有两个或多个节点具有相同的标签(但从根开始的路径不同),则可能会出现歧义/冲突的问题。我使用了一个新的自动版本进行编辑。是的,可能存在歧义。但是,如果确实是这样,两个或多个节点可以具有相同的标签但路径不同,那么原始表中实际上没有冗余,可以保持原样。我觉得必须有一个解决方案,使用Reduce()
或split()
,但我就是不明白。@time+1表示指向“data.tree”包的指针。谢谢美好的我使用类似的by/split
逻辑得到的最接近的结果是with(test_data,Map(split,split(cat_id,cat_h1),split(cat_h2,cat_h1))
在它崩溃之前。顺序不重要!递归是可以的。非常感谢你!
library(dplyr)
h1_h2 =
test_data %>%
select(cat_h1, cat_h2) %>%
distinct %>%
filter(cat_h2 %>% is.na %>% `!`)
h2_h3 =
test_data %>%
select(cat_h2, cat_h3) %>%
distinct %>%
filter(cat_h3 %>% is.na %>% `!`)
h3_h4 =
test_data %>%
select(cat_h3, cat_h4) %>%
distinct %>%
filter(cat_h4 %>% is.na %>% `!`)
h1_h2 %>%
left_join(h2_h3) %>%
left_join(h3_h4)
library(dplyr)
library(lazyeval)
adjacency = function(data) {
adjacency_table = function(data, larger_name, smaller_name)
lazy(data %>%
select(larger_name, smaller_name) %>%
distinct %>%
filter(smaller_name %>% is.na %>% `!`) ) %>%
interp(larger_name = larger_name %>% as.name,
smaller_name = smaller_name %>% as.name) %>%
lazy_eval %>%
setNames(c("larger", "smaller"))
data_frame(smaller_name = data %>% names) %>%
mutate(larger_name = smaller_name %>% lag) %>%
slice(-1) %>%
group_by(larger_name, smaller_name) %>%
do(adjacency_table(data, .$larger_name, .$smaller_name) )
}
result =
test_data %>%
select(-cat_id) %>%
adjacency
f <- function(x, d=cbind(x,NA)) {
c(
# call f by branch
if(ncol(d) > 3) local({
x <- d[!is.na(d[[3]]),]
by( x[-2], droplevels(x[2]), f, x=NA, simplify=FALSE)
}),
# leaf nodes
setNames(as.list(d[[1]]), d[[2]])[is.na(d[[3]])]
)
}
> str(f(test_data))
List of 2
$ Animals:List of 2
..$ Birds :List of 2
.. ..$ Hawks: num 650
.. ..$ Other: num 126
..$ Mammals:List of 2
.. ..$ Dogs :List of 2
.. .. ..$ Big : num 661
.. .. ..$ Little: num 197
.. ..$ Other: num 228
$ Plants :List of 3
..$ Wheat: num 912
..$ Grass: num 949
..$ Other: num 428