在R中,如何对data.frame的特定子集执行操作?

在R中,如何对data.frame的特定子集执行操作?,r,R,(我有一种感觉,在我得到答案后,我会觉得自己很傻,但我就是想不出来。) 我有一个data.frame,末尾有一个空列。它将主要填充NAs,但我想用一个值填充其中的一些行。此列表示对data.frame中某列缺少的数据的猜测 我的初始data.frame如下所示: Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess --------------------------------------------------------- A

(我有一种感觉,在我得到答案后,我会觉得自己很傻,但我就是想不出来。)

我有一个data.frame,末尾有一个空列。它将主要填充NAs,但我想用一个值填充其中的一些行。此列表示对data.frame中某列缺少的数据的猜测

我的初始data.frame如下所示:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A    | 6      | 3          | 6          |
B    | 7      | 3          | 7          |
C    | 6.5    | 3          | N/A        |median(df$MaxPlayers[df$MinPlayers ==3,])
D    | 7      | 3          | 6          |
E    | 7      | 3          | 5          |
F    | 9.5    | 2          | 5          |
G    | 6      | 2          | 4          |
H    | 7      | 2          | 4          |
I    | 6.5    | 2          | N/A        |median(df$MaxPlayers[df$MinPlayers ==2,])
J    | 7      | 2          | 2          |
K    | 7      | 2          | 4          |
Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A    | 6      | 3          | 6          |
B    | 7      | 3          | 7          |
C    | 6.5    | 3          | N/A        |6
D    | 7      | 3          | 6          |
E    | 7      | 3          | 5          |
F    | 9.5    | 2          | 5          |
G    | 6      | 2          | 4          |
H    | 7      | 2          | 4          |
I    | 6.5    | 2          | N/A        |4
J    | 7      | 2          | 2          |
K    | 7      | 2          | 4          |
请注意,其中两行对于MaxPlayer具有“N/A”。我试图做的是利用我掌握的信息来猜测MaxPlayers可能是什么。如果3名玩家游戏的中位数(MaxPlayers)为6,则对于MinPlayers==3且MaxPlayers==N/A的游戏,MaxPlayerGuess应等于6。(我已尝试在代码中指出在上述示例中MaxPlayerGuess应获得的值。)

生成的data.frame如下所示:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A    | 6      | 3          | 6          |
B    | 7      | 3          | 7          |
C    | 6.5    | 3          | N/A        |median(df$MaxPlayers[df$MinPlayers ==3,])
D    | 7      | 3          | 6          |
E    | 7      | 3          | 5          |
F    | 9.5    | 2          | 5          |
G    | 6      | 2          | 4          |
H    | 7      | 2          | 4          |
I    | 6.5    | 2          | N/A        |median(df$MaxPlayers[df$MinPlayers ==2,])
J    | 7      | 2          | 2          |
K    | 7      | 2          | 4          |
Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A    | 6      | 3          | 6          |
B    | 7      | 3          | 7          |
C    | 6.5    | 3          | N/A        |6
D    | 7      | 3          | 6          |
E    | 7      | 3          | 5          |
F    | 9.5    | 2          | 5          |
G    | 6      | 2          | 4          |
H    | 7      | 2          | 4          |
I    | 6.5    | 2          | N/A        |4
J    | 7      | 2          | 2          |
K    | 7      | 2          | 4          |
要共享一次尝试的结果,请执行以下操作:

gld$MaxPlayersGuess <- ifelse(is.na(gld$MaxPlayers), median(gld$MaxPlayers[gld$MinPlayers,]), NA)


Error in gld$MaxPlayers[gld$MinPlayers, ] : 
incorrect number of dimensions

gld$MaxPlayersGuess相对于发布的示例进行更新

这是我一天的小贴士,有时候计算你想要什么,然后在需要的时候抓住它比使用所有这些逻辑连接更容易。你试图想出一种方法,一次计算出所有的数据,这会让人困惑,把它分成几个步骤。您需要知道每个可能的“MinPlayer”组的“MaxPlayer”的中值。然后,您希望在缺少MaxPlayer时使用该值。所以这里有一个简单的方法

#generate fake data 
MinPlayer <- rep(3:2, each = 4)
MaxPlayer <- rep(2:5, each = 2, times = 2)

df <- data.frame(MinPlayer, MaxPlayer)

#replace some values of MaxPlayer with NA
df$MaxPlayer <- ifelse(df$MaxPlayer == 3, NA, df$MaxPlayer)

####STARTING DATA
# > df
# MinPlayer MaxPlayer
# 1          3         2
# 2          3         2
# 3          3        NA
# 4          3        NA
# 5          2         4
# 6          2         4
# 7          2         5
# 8          2         5
# 9          3         2
# 10         3         2
# 11         3        NA
# 12         3        NA
# 13         2         4
# 14         2         4
# 15         2         5
# 16         2         5

####STEP 1
#find the median of MaxPlayer for each group of MinPlayer (e.g., when MinPlayer == 1, 2 or whatever)
#just add a column to the data frame that has the right median value for each subset of MinPlayer in it and grab that value to use later. 
library(plyr) #plyr is a great way to compute things across data subsets
df <- ddply(df, c("MinPlayer"), transform, 
            median.minp = median(MaxPlayer, na.rm = TRUE)) #ignore NAs in the median

####STEP 2
#anytime that MaxPlayer == NA, grab the median value to replace the NA, otherwise keep the MaxPlayer value
df$MaxPlayer <- ifelse(is.na(df$MaxPlayer), df$median.minp, df$MaxPlayer)

####STEP 3
#you had to compute an extra column you don't really want, so drop it now that you're done with it
df <- df[ , !(names(df) %in% "median.minp")]

####RESULT
# > df
# MinPlayer MaxPlayer
# 1          2         4
# 2          2         4
# 3          2         5
# 4          2         5
# 5          2         4
# 6          2         4
# 7          2         5
# 8          2         5
# 9          3         2
# 10         3         2
# 11         3         2
# 12         3         2
# 13         3         2
# 14         3         2
# 15         3         2
# 16         3         2
#生成假数据

MinPlayer相对于发布的示例更新

这是我一天的小贴士,有时候计算你想要什么,然后在需要的时候抓住它比使用所有这些逻辑连接更容易。你试图想出一种方法,一次计算出所有的数据,这会让人困惑,把它分成几个步骤。您需要知道每个可能的“MinPlayer”组的“MaxPlayer”的中值。然后,您希望在缺少MaxPlayer时使用该值。所以这里有一个简单的方法

#generate fake data 
MinPlayer <- rep(3:2, each = 4)
MaxPlayer <- rep(2:5, each = 2, times = 2)

df <- data.frame(MinPlayer, MaxPlayer)

#replace some values of MaxPlayer with NA
df$MaxPlayer <- ifelse(df$MaxPlayer == 3, NA, df$MaxPlayer)

####STARTING DATA
# > df
# MinPlayer MaxPlayer
# 1          3         2
# 2          3         2
# 3          3        NA
# 4          3        NA
# 5          2         4
# 6          2         4
# 7          2         5
# 8          2         5
# 9          3         2
# 10         3         2
# 11         3        NA
# 12         3        NA
# 13         2         4
# 14         2         4
# 15         2         5
# 16         2         5

####STEP 1
#find the median of MaxPlayer for each group of MinPlayer (e.g., when MinPlayer == 1, 2 or whatever)
#just add a column to the data frame that has the right median value for each subset of MinPlayer in it and grab that value to use later. 
library(plyr) #plyr is a great way to compute things across data subsets
df <- ddply(df, c("MinPlayer"), transform, 
            median.minp = median(MaxPlayer, na.rm = TRUE)) #ignore NAs in the median

####STEP 2
#anytime that MaxPlayer == NA, grab the median value to replace the NA, otherwise keep the MaxPlayer value
df$MaxPlayer <- ifelse(is.na(df$MaxPlayer), df$median.minp, df$MaxPlayer)

####STEP 3
#you had to compute an extra column you don't really want, so drop it now that you're done with it
df <- df[ , !(names(df) %in% "median.minp")]

####RESULT
# > df
# MinPlayer MaxPlayer
# 1          2         4
# 2          2         4
# 3          2         5
# 4          2         5
# 5          2         4
# 6          2         4
# 7          2         5
# 8          2         5
# 9          3         2
# 10         3         2
# 11         3         2
# 12         3         2
# 13         3         2
# 14         3         2
# 15         3         2
# 16         3         2
#生成假数据

MinPlayer我想你已经在@griffmer的答案中找到了你所需要的一切。但一种不那么优雅但可能更直观的方式可能是循环:

## Your data:
df <- data.frame(
        Game = LETTERS[1:11],
        Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7),
        MinPlayers = c(rep(3,5), rep(2,6)),
        MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)     
)

## Loop over rows:
df$MaxPlayersGuess <- vapply(1:nrow(df), function(ii){
            if (is.na(df$MaxPlayers[ii])){
                median(df$MaxPlayers[df$MinPlayers == df$MinPlayers[ii]],
                        na.rm = TRUE)               
            } else {
                df$MaxPlayers[ii]
            }           
        }, numeric(1))

我想你已经在@griffmer的答案中找到了你所需要的一切。但一种不那么优雅但可能更直观的方式可能是循环:

## Your data:
df <- data.frame(
        Game = LETTERS[1:11],
        Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7),
        MinPlayers = c(rep(3,5), rep(2,6)),
        MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)     
)

## Loop over rows:
df$MaxPlayersGuess <- vapply(1:nrow(df), function(ii){
            if (is.na(df$MaxPlayers[ii])){
                median(df$MaxPlayers[df$MinPlayers == df$MinPlayers[ii]],
                        na.rm = TRUE)               
            } else {
                df$MaxPlayers[ii]
            }           
        }, numeric(1))

如果要使用dplyr
,可以尝试:

输入:

df <- data.frame(
  Game = LETTERS[1:11],
  Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7),
  MinPlayers = c(rep(3,5), rep(2,6)),
  MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)     
)
这将对数据基础
MinPlayers
进行分组,然后将
MaxPlayers
的中值分配给缺少数据的行

输出:

Source: local data frame [11 x 4]
Groups: MinPlayers [2]

     Game Rating MinPlayers MaxPlayers
   <fctr>  <dbl>      <dbl>      <dbl>
1       A    6.0          3          6
2       B    7.0          3          7
3       C    6.5          3          6
4       D    7.0          3          6
5       E    7.0          3          5
6       F    9.5          2          5
7       G    6.0          2          4
8       H    7.0          2          4
9       I    6.5          2          4
10      J    7.0          2          2
11      K    7.0          2          4
来源:本地数据帧[11 x 4]
组别:MinPlayers[2]
游戏评级MinPlayers MaxPlayers
1A 6.03 6
2 B 7.0 3 7
3 C 6.5 3 6
4 D 7.0 3 6
5 E 7.0 3 5
6 F 9.5 2 5
7 G 6.0 2 4
8小时7.0 2 4
9 I 6.5 2 4
10 J 7.0 2
11 K 7.0 2 4

如果要使用
dplyr
,可以尝试:

输入:

df <- data.frame(
  Game = LETTERS[1:11],
  Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7),
  MinPlayers = c(rep(3,5), rep(2,6)),
  MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)     
)
这将对数据基础
MinPlayers
进行分组,然后将
MaxPlayers
的中值分配给缺少数据的行

输出:

Source: local data frame [11 x 4]
Groups: MinPlayers [2]

     Game Rating MinPlayers MaxPlayers
   <fctr>  <dbl>      <dbl>      <dbl>
1       A    6.0          3          6
2       B    7.0          3          7
3       C    6.5          3          6
4       D    7.0          3          6
5       E    7.0          3          5
6       F    9.5          2          5
7       G    6.0          2          4
8       H    7.0          2          4
9       I    6.5          2          4
10      J    7.0          2          2
11      K    7.0          2          4
来源:本地数据帧[11 x 4]
组别:MinPlayers[2]
游戏评级MinPlayers MaxPlayers
1A 6.03 6
2 B 7.0 3 7
3 C 6.5 3 6
4 D 7.0 3 6
5 E 7.0 3 5
6 F 9.5 2 5
7 G 6.0 2 4
8小时7.0 2 4
9 I 6.5 2 4
10 J 7.0 2
11 K 7.0 2 4

抱歉,因为我甚至不知道如何开始编写程序,我不知道如何提供一个可复制的示例。感谢您尝试回答。通过尝试你的一些建议,我能够更好地看到这个问题,并找出如何发布一个示例。@Zelbinian,所以通常你会将griffmer的作为answer@Chris但我仍然不知道如何解决这个问题。。。我所学到的只是如何更准确地陈述它。关键的区别在于计算所基于的值取决于当前正在执行的同一行中的MinPlayers值,我不知道如何访问该值。@Zelbinian,is.na()是您要确定在何处输入最大玩家猜测的测试。所以:max_play_idx表示歉意,因为我甚至不知道如何开始编写程序,我不知道如何提供一个可复制的示例。感谢您尝试回答。通过尝试你的一些建议,我能够更好地看到这个问题,并找出如何发布一个示例。@Zelbinian,所以通常你会将griffmer的作为answer@Chris但我仍然不知道如何解决这个问题。。。我所学到的只是如何更准确地陈述它。关键的区别在于计算所基于的值取决于当前正在执行的同一行中的MinPlayers值,我不知道如何访问该值。@Zelbinian,is.na()是您要确定在何处输入最大玩家猜测的测试。所以:max_play_idx