R中数据帧低频数据滤波的有效方法
我有一个data.frame,有几列,希望根据变量的组合过滤低频数据。例如,性别变量为男性/女性,胆固醇变量为高/低。然后,我的数据帧将如下所示:R中数据帧低频数据滤波的有效方法,r,R,我有一个data.frame,有几列,希望根据变量的组合过滤低频数据。例如,性别变量为男性/女性,胆固醇变量为高/低。然后,我的数据帧将如下所示: set.seed(123) Sex = sample(c('Male','Female'),size = 20,replace = TRUE) Age = sample(c('Low','High'),size = 20,replace = TRUE) Index = 1:20 df = data.frame(index = Index,Sex=Se
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
index Sex Age
1 1 Male High
2 2 Female High
3 3 Male High
4 4 Female High
5 5 Female High
6 6 Male High
7 7 Female High
8 8 Female High
9 9 Female Low
10 10 Male Low
11 11 Female High
12 12 Male High
13 13 Female High
14 14 Female High
15 15 Male Low
16 16 Female Low
17 17 Male High
18 18 Male Low
19 19 Male Low
20 20 Female Low
现在我想过滤频率高于3的性别/年龄组合
table(df[,2:3])
Age
Sex High Low
Female 8 3
Male 5 4
换句话说,我想保持女性高、男性低和男性高的指数
注意1)我的数据框有多个变量(与上面的示例不同),2)我不希望使用任何第三个R包,3)我希望它速度快。好的,这里有一个Base-R选项
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
merge(
df
, aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
, by = c("Sex", "Age")
)
聚合函数对所有组合的所有
1
s求和。下面是一个简单的基数R方法:
lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
# index Sex Age
#1 1 Male High
#2 2 Female High
#3 3 Male High
#4 4 Female High
#5 5 Female High
#6 6 Male High
#7 7 Female High
#8 8 Female High
#10 10 Male Low
#11 11 Female High
#12 12 Male High
#13 13 Female High
#14 14 Female High
#15 15 Male Low
#17 17 Male High
#18 18 Male Low
#19 19 Male Low
一个
dplyr
答案是
library(dplyr)
df %>%
group_by(Sex, Age) %>%
filter(n() > 3)
尽管OP中有说明,但这不是基本的R解决方案。我认为这可能对没有这些限制的未来用户有用
vars我们可以用data.table
实现这一点,它也应该是高效的
vars <- c("Sex","Age")
max_freq <- 3
new_df <- merge(df, subset(as.data.frame(table(df[,vars])),Freq>max_freq)[1:2])
new_df
# Sex Age index
# 1 Female High 2
# 2 Female High 7
# 3 Female High 14
# 4 Female High 11
# 5 Female High 5
# 6 Female High 4
# 7 Female High 13
# 8 Female High 8
# 9 Male High 6
# 10 Male High 3
# 11 Male High 1
# 12 Male High 17
# 13 Male High 12
# 14 Male Low 10
# 15 Male Low 15
# 16 Male Low 18
# 17 Male Low 19
library(data.table)
setDT(df)[, .SD[.N > 3], .(Sex, Age)]
或使用.I
setDT(df)[df[, .I[.N >3], .(Sex, Age)]$V1]
您只想使用Base-R有什么好的理由吗?否则,我有一个漂亮而优雅的问题要问你。我使用多核功能,这使得很难将第三个包传递到过程df%>%groupby(Sex,Age)%>%mutate(occurrences=n())
它在R中吗?什么是R包?哦,dplyr是一个problmatic包,我已经试过了,所以我在下面贴了一个Base-R答案。一个附录:你说你希望它快点。如果这真的很重要,你应该三思而后行。Dplyr更快,如果您确实需要它,还可以更快地使用数据。表格是您的首选包。您可以使用df%>%groupby(Sex,Age)%%>%filter(n()>3)
添加子集,而无需使用df$x
library(data.table)
setDT(df)[, .SD[.N > 3], .(Sex, Age)]
setDT(df)[df[, .I[.N >3], .(Sex, Age)]$V1]