R 基于类别的动态示例_R_Dplyr

R 基于类别的动态示例

R 基于类别的动态示例,r,dplyr,R,Dplyr,我有一个数据框，有两列ID和category。从这个数据帧中，我从每个类别中抽取3个样本，从这个样本中，我从总数中抽取6个子样本，得到一个包含6个元素的数据帧库（dplyr）种子集（123） df%sample\n（3）%%>%ungroup（）%%>%sample\n（6） sub_df ##tibble:6 x 2 #ID类别 # #172g #288 I #3 24 J #433克 #5 86 E #6 27楼我想再次从原始数据帧df采样。但是，这次n取决于sub_

我有一个数据框，有两列

ID

和

category

。从这个数据帧中，我从每个类别中抽取3个样本，从这个样本中，我从总数中抽取6个子样本，得到一个包含6个元素的数据帧

库（dplyr）
种子集（123）
df%sample\n（3）%%>%ungroup（）%%>%sample\n（6）
sub_df
##tibble:6 x 2
#ID类别
#        
#172g
#288 I
#3 24 J
#433克
#5 86 E
#6 27楼

我想再次从原始数据帧

df

采样。但是，这次

取决于

sub_df

数据帧中每个类别的计数

sub_df%>%计数（类别）
#一个tibble:5x2
#类别n
#       
#1 E 1
#2f1
#3g2
#4 I 1
#5 J 1

对于未在

sub_df

中表示的每个组，我想像上面一样从

df

中抽取3个样本。但是，对于包含在

sub_df

中的那些类别，如果生成的数据帧与

sub_df

组合，我希望对

3-n

次采样，以获得所有类别中总共3个样本。在这个例子中，E，F，I，J都有2个样本，而G只需要1个

我想我可以循环浏览每个类别，并根据

sub_df

中每个类别的计数来做一个样本。然而，由于类别的数量变得非常大，这个循环可能需要大量的时间。我希望有一个更整洁的方法来做到这一点

结果计数如下所示：

结果\u df%>%计数（类别）
#一个tibble:10x2
#类别n
#        
#1 A 3
#2 B 3
#3 C 3
#4d3
#5 E 2
#6f2
#7 G 1
#8小时3
#9 I 2
#10 J 2

使用

dplyr

和

purrr

的解决方案。其思想是首先创建数据帧

df_s2

，显示每个类别的新样本数，按类别将

df

拆分为

df_列表

，并在

df_列表

和

df_s2

中的数字上应用

sample_n

函数

library(dplyr)
set.seed(123)
df <- data.frame(ID = 1:100, category = sample(LETTERS[1:10], 100, replace = T))
sub_df <- df %>% group_by(category) %>% sample_n(3) %>% ungroup() %>% sample_n(6)

library(purrr)

# Create a table to store the sample number, default to 3
df_s <- data_frame(category = unique(df$category),
                   Number = 3)

# Minus the count number in sub_df
df_s2 <- df_s %>%
  left_join(sub_df %>% count(category), by = "category") %>%
  mutate(n = ifelse(is.na(n), 0, n)) %>%
  mutate(Number = Number - n) %>%
  select(-n) %>%
  arrange(category)

# Split the df by category
df_list <- split(df, f = df$category)

# Apply the sample function on df_list based on df_s2
result_df <- map2_dfr(df_list, df_s2$Number, ~sample_n(.x, .y))

# Check the count number of result_df
result_df %>% count(category)
# # A tibble: 10 x 2
#    category     n
#    <fct>    <int>
#  1 A            3
#  2 B            3
#  3 C            3
#  4 D            3
#  5 E            2
#  6 F            2
#  7 G            1
#  8 H            3
#  9 I            2
# 10 J            2

库（dplyr）
种子集（123）
df%sample\n（3）%%>%ungroup（）%%>%sample\n（6）
图书馆（purrr）
#创建一个表来存储样本编号，默认为3
df_s%计数（类别），按=“类别”）%%>%
突变（n=ifelse（is.na（n），0，n））%>%
变异（数=数-n）%>%
选择（-n）%>%
排列（类别）
#按类别拆分df
df_列表