R 为数据帧中集合的每次重复在向量之后对数据帧列进行排序
我有以下数据框:R 为数据帧中集合的每次重复在向量之后对数据帧列进行排序,r,R,我有以下数据框: col1 <- 1:10 col2 <- rep(c("COL","CIP","CHL","GEN","TMP"), 2) col3 <- rep(c("spec1", "spec2"), each = 5) df <- data.frame(col1, col2, col3, stringsAsFactors = F) library(dplyr) order_vector <- c("CHL","GEN","COL","CIP","TMP")
col1 <- 1:10
col2 <- rep(c("COL","CIP","CHL","GEN","TMP"), 2)
col3 <- rep(c("spec1", "spec2"), each = 5)
df <- data.frame(col1, col2, col3, stringsAsFactors = F)
library(dplyr)
order_vector <- c("CHL","GEN","COL","CIP","TMP")
df <- df %>%
slice(match(order_vector, col2))
col1 col2 col3
3 CHL spec1
4 GEN spec1
1 COL spec1
2 CIP spec1
5 TMP spec1
但是,我希望这适用于col3中的所有因子值,最好是使用dplyr。如果您将
col2
设置为因子,将顺序向量设置为级别,您可以按其排序
library(dplyr)
df %>% mutate_at("col2",factor,levels=order_vector) %>%
arrange(col3,col2) %>%
mutate_at("col2",as.character) # if you want to go back to characters, but maybe you shouldn't
# col1 col2 col3
# 1 3 CHL spec1
# 2 4 GEN spec1
# 3 1 COL spec1
# 4 2 CIP spec1
# 5 5 TMP spec1
# 6 8 CHL spec2
# 7 9 GEN spec2
# 8 6 COL spec2
# 9 7 CIP spec2
# 10 10 TMP spec2
或者更简单,受CPak答案的启发:
df %>% arrange(col3,factor(col2,levels=order_vector))
您还可以使用dplyr
连接保持顺序这一事实:
df %>%
right_join(data.frame(col2=order_vector)) %>%
arrange(col3)
# col1 col2 col3
# 1 3 CHL spec1
# 2 4 GEN spec1
# 3 1 COL spec1
# 4 2 CIP spec1
# 5 5 TMP spec1
# 6 8 CHL spec2
# 7 9 GEN spec2
# 8 6 COL spec2
# 9 7 CIP spec2
# 10 10 TMP spec2
您可以使用forcats::fct_relevel
df %>%
arrange(forcats::fct_relevel(col2, order_vector))
# col1 col2 col3
# 1 3 CHL spec1
# 2 8 CHL spec2
# 3 4 GEN spec1
# 4 9 GEN spec2
# 5 1 COL spec1
# 6 6 COL spec2
# 7 2 CIP spec1
# 8 7 CIP spec2
# 9 5 TMP spec1
# 10 10 TMP spec2
不将col2
作为一个因素的选项是,在您的match
调用之前添加groupby
语句:
library(dplyr)
col1 <- 1:10
col2 <- rep(c("COL","CIP","CHL","GEN","TMP"), 2)
col3 <- rep(c("spec1", "spec2"), each = 5)
df <- data.frame(col1, col2, col3, stringsAsFactors = F)
order_vector <- c("CHL","GEN","COL","CIP","TMP")
df <- df %>%
group_by(col3) %>%
slice(match(order_vector, col2))
df
库(dplyr)
col1不知何故,group_by()解决方案在示例数据帧上起作用,但在我的数据帧上不起作用,因为它仍然只保留col3中的第一个值。唯一的区别是col2列中有更多的值,还有一些额外的列,但这不重要吗?谢谢,这很有效。你认为这是一种快速的方法吗?我现在并没有一个很大的数据框架,但以后我可能会有更多的数据来尝试这个框架。对因子进行排序应该很快,转换为因子也应该很快。要优化速度,请从一开始就将所有包含CHL等的表转换为因子,然后只要在需要排序时df%>%arrange(col3,col2)
。很可能这两种方式都会很快。
# A tibble: 10 x 3
# Groups: col3 [2]
col1 col2 col3
<int> <chr> <chr>
1 3 CHL spec1
2 4 GEN spec1
3 1 COL spec1
4 2 CIP spec1
5 5 TMP spec1
6 8 CHL spec2
7 9 GEN spec2
8 6 COL spec2
9 7 CIP spec2
10 10 TMP spec2