R中数据帧低频数据滤波的有效方法

R中数据帧低频数据滤波的有效方法,r,R,我有一个data.frame,有几列,希望根据变量的组合过滤低频数据。例如,性别变量为男性/女性,胆固醇变量为高/低。然后,我的数据帧将如下所示: set.seed(123) Sex = sample(c('Male','Female'),size = 20,replace = TRUE) Age = sample(c('Low','High'),size = 20,replace = TRUE) Index = 1:20 df = data.frame(index = Index,Sex=Se

我有一个data.frame,有几列,希望根据变量的组合过滤低频数据。例如,性别变量为男性/女性,胆固醇变量为高/低。然后,我的数据帧将如下所示:

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df


  index    Sex  Age
1      1   Male High
2      2 Female High
3      3   Male High
4      4 Female High
5      5 Female High
6      6   Male High
7      7 Female High
8      8 Female High
9      9 Female  Low
10    10   Male  Low
11    11 Female High
12    12   Male High
13    13 Female High
14    14 Female High
15    15   Male  Low
16    16 Female  Low
17    17   Male High
18    18   Male  Low
19    19   Male  Low
20    20 Female  Low
现在我想过滤频率高于3的性别/年龄组合

table(df[,2:3])
        Age
Sex      High Low
  Female    8   3
  Male      5   4
换句话说,我想保持女性高、男性低和男性高的指数


注意1)我的数据框有多个变量(与上面的示例不同),2)我不希望使用任何第三个R包,3)我希望它速度快。

好的,这里有一个Base-R选项

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df

merge(
    df
    , aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
    , by = c("Sex", "Age")
)

聚合函数对所有组合的所有
1
s求和。

下面是一个简单的基数R方法:

lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

#   index    Sex  Age
#1      1   Male High
#2      2 Female High
#3      3   Male High
#4      4 Female High
#5      5 Female High
#6      6   Male High
#7      7 Female High
#8      8 Female High
#10    10   Male  Low
#11    11 Female High
#12    12   Male High
#13    13 Female High
#14    14 Female High
#15    15   Male  Low
#17    17   Male High
#18    18   Male  Low
#19    19   Male  Low

一个
dplyr
答案是

library(dplyr)
df %>% 
  group_by(Sex, Age) %>% 
  filter(n() > 3) 

尽管OP中有说明,但这不是基本的R解决方案。我认为这可能对没有这些限制的未来用户有用

vars我们可以用
data.table
实现这一点,它也应该是高效的

vars     <- c("Sex","Age")
max_freq <- 3
new_df   <- merge(df, subset(as.data.frame(table(df[,vars])),Freq>max_freq)[1:2])

new_df
#       Sex  Age index
# 1  Female High     2
# 2  Female High     7
# 3  Female High    14
# 4  Female High    11
# 5  Female High     5
# 6  Female High     4
# 7  Female High    13
# 8  Female High     8
# 9    Male High     6
# 10   Male High     3
# 11   Male High     1
# 12   Male High    17
# 13   Male High    12
# 14   Male  Low    10
# 15   Male  Low    15
# 16   Male  Low    18
# 17   Male  Low    19
library(data.table)
setDT(df)[, .SD[.N > 3], .(Sex, Age)]
或使用
.I

setDT(df)[df[, .I[.N >3], .(Sex, Age)]$V1]

您只想使用Base-R有什么好的理由吗?否则,我有一个漂亮而优雅的问题要问你。我使用多核功能,这使得很难将第三个包传递到过程
df%>%groupby(Sex,Age)%>%mutate(occurrences=n())
它在R中吗?什么是R包?哦,dplyr是一个problmatic包,我已经试过了,所以我在下面贴了一个Base-R答案。一个附录:你说你希望它快点。如果这真的很重要,你应该三思而后行。Dplyr更快,如果您确实需要它,还可以更快地使用数据。表格是您的首选包。您可以使用
df%>%groupby(Sex,Age)%%>%filter(n()>3)
添加子集,而无需使用
df$x
library(data.table)
setDT(df)[, .SD[.N > 3], .(Sex, Age)]
setDT(df)[df[, .I[.N >3], .(Sex, Age)]$V1]