R 当数据按类分组时,如何跨数据删除具有条件的特定行?

R 当数据按类分组时,如何跨数据删除具有条件的特定行?,r,R,我有一个数据帧 structure(list(group = structure(c(2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 1L, 3L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L,

我有一个数据帧

structure(list(group = structure(c(2L, 2L, 2L, 3L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 
1L, 3L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 
3L, 1L, 3L, 1L, 2L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 
1L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c("GAD", 
"NAC", "SP"), class = "factor"), age = c(22, 37, 19, 59, 45, 
54, 19, 21, 19, 21, 25, 18, 18, 19, 20, 18, 19, 20, 19, 22, 28, 
19, 65, 20, 18, 19, 18, 18, 56, 25, 18, 27, 20, 27, 18, 55, 19, 
26, 18, 46, 62, 25, 19, 18, 19, 23, 28, 20, 29, 18, 37, 18, 46, 
18, 23, 26, 49, 59, 43, 20, 46, 35, 18, 54, 25, 48, 26, 27, 27, 
43, 29, 41, 43, 29, 19, 18, 19, 30, 27, 44, 46, 65, 36, 29, 38, 
26), worry = c(17, 18, 18, 22, 23, 23, 23, 24, 25, 27, 27, 28, 
29, 30, 30, 31, 32, 34, 34, 36, 37, 40, 42, 42, 43, 44, 45, 45, 
46, 46, 47, 48, 49, 50, 50, 53, 53, 55, 55, 56, 56, 56, 56, 57, 
59, 60, 60, 60, 61, 61, 61, 61, 61, 61, 62, 64, 66, 67, 67, 67, 
68, 68, 68, 69, 69, 70, 71, 71, 71, 71, 72, 72, 72, 72, 73, 73, 
75, 76, 76, 76, 76, 78, 80, 80, 80, 80), incor_Cz = c(0.905655679, 
-5.972279231, -0.441464378, -7.768101371, -0.068112561, -5.9488735, 
4.917631564, 3.560398459, 3.62044852, 3.208378382, 6.383463977, 
3.101797215, 2.928925966, 10.92697216, 9.674200152, -0.430347693, 
5.768622107, 4.361622622, 3.814244831, 10.6478174, 4.621914209, 
4.015470126, -2.990363994, 10.28108226, 4.330419384, 4.777957595, 
-2.351932712, -0.86237015, -3.487416819, -5.954685457, 0.082161102, 
2.69205892, -2.195755315, 10.44202624, 1.727674592, 4.310826532, 
8.370135468, 9.529998174, 11.84098752, 2.449555383, -5.489426436, 
6.802779597, 0.217815002, 10.06140598, 2.626799279, -3.593214611, 
-2.486217625, -11.32397897, 7.154051703, 6.901286517, 3.504033222, 
-6.316759194, 10.70866173, -8.972840718, 4.533894362, -11.77410765, 
0.236432185, -3.721355061, -0.440954973, -15.3296636, -0.320463156, 
-7.644082526, 5.732567823, -0.659948993, 5.331566103, -1.161087095, 
4.699510759, 5.038408832, -3.100193429, 0.712125907, 10.28751091, 
-0.926246126, 8.789326896, -2.642870899, 1.412052899, 1.266241584, 
9.31459946, -0.827073637, 0.302046533, -1.002243048, -3.36313534, 
3.96444658, -1.022874301, 14.25621138, -1.30046704, 2.30875538
), corr_Cz = c(6.483764554, 0.17135543, 6.839731626, 3.502085263, 
5.464570162, -3.898580751, 8.486522854, 5.193051225, -1.077336305, 
2.253276067, 6.734594272, 1.008001519, 2.752022253, 10.15283381, 
10.67605329, 0.054572416, 3.298597911, 12.50543853, 9.012508794, 
9.900038662, 6.509256106, 2.953717593, 2.437522863, 11.26964708, 
5.085908835, 5.054000349, -0.376062125, 1.992393525, 6.489963996, 
6.411416639, -0.65324494, -0.572531358, -3.488881215, 10.5146121, 
8.979631825, 5.883346362, 8.835913808, 9.126806683, 13.09475723, 
0.469198649, 1.605589433, 7.74512423, 1.330835368, 8.015422928, 
6.225187747, 0.008224673, 2.714404145, 1.245554826, 2.277742942, 
1.753820412, 5.114288415, 0.285880059, 10.42432614, -2.280815921, 
2.527486235, -6.767570127, 3.347916611, 3.135211125, -1.282160871, 
-2.483906663, 10.96091046, -0.026853122, 9.81999986, -0.541655651, 
7.566954252, 1.971577596, 3.272944482, 9.747471161, 12.14564621, 
5.960042605, 7.480088326, 8.952888624, 6.302918576, -0.881073076, 
3.246495941, 9.763856362, 1.720188523, 3.033841316, 12.46009515, 
2.589991797, 3.187351241, -3.483036943, 3.088361102, 4.390436546, 
0.046362569, 2.779881841)), row.names = c(1L, 2L, 3L, 4L, 6L, 
7L, 8L, 9L, 10L, 11L, 12L, 13L, 15L, 16L, 17L, 18L, 19L, 21L, 
22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 32L, 33L, 34L, 35L, 36L, 
37L, 38L, 39L, 40L, 42L, 43L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 
54L, 55L, 56L, 57L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 68L, 
69L, 70L, 71L, 72L, 73L, 74L, 76L, 77L, 78L, 80L, 81L, 82L, 83L, 
85L, 86L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 97L, 99L, 102L, 
103L, 104L, 105L), class = "data.frame")
如你所见,我们有三个不同的小组。数据分别被这三个不同的组分解。我希望删除偏离平均值3个或更多标准偏差的特定行。我只对incor_Cz和corr_Cz的数据偏离3个或更多标准差感兴趣,忽略年龄和担忧。我在计算了标准偏差和平均值后创建了函数来实现这一点。当我在使用
by()
函数后尝试运行它时,会得到一个没有数据的数据帧,而不是在没有大于3的值时不删除任何内容。我的工作是:

remove_rows_corr <- function(x, na.rm = TRUE) {
  x <- x[!(x >= 3),]
  return(x)
}

remove_rows_incorr <- function(x, na.rm = TRUE) {
  x <- x[!(x >= 3),]
  return(x)
}

sd_incorr <- sd(data$incor_Cz)
average_incorr <- mean(data$incor_Cz)

sd_corr <- sd(data$corr_Cz)
average_corr <- mean(data$corr_Cz)

dflist <- by(data, data$group, function(data){
  data$standard_deviations_incorr <-  lapply((data$incor_Cz-average_incorr)/sd_incorr, FUN = abs)
  return(data)
})

data <- do.call(rbind, dflist)

data <- as.data.frame(lapply(data$standard_deviations_incorr, FUN = remove_rows_incorr))
然后分解数据帧,分别进行这些计算

GAD_only <- data[data$group == 'GAD',]

GAD_only$standard_deviations_corr <- lapply((GAD_only$corr_Cz-3.498088)/4.033308, FUN = abs)

GAD_only我找到了一个不使用我创建的函数的方法。我想得太多了

我不使用函数,而是在按组分解数据并计算出标准偏差后直接应用相同的逻辑

data <- data[!(data$standard_deviations_incorr >= 3),]
data=3),]

这将删除“标准偏差”incorr列下任何值为3或更大的现有行。

我可能不完全理解您想要做什么,但这里有一些想法。我们可以使用
dplyr
包和
filter
功能进行此操作。使用
if_all
可确保两列中的所有行都满足以“_Cz”结尾的条件。我们还可以定义一个函数
temp\u fun
来定义条件

library(dplyr)

# Create a funciton to determine if a number in a vector is 
# abs((x - x_mean)/x_sd) < 3
temp_fun <- function(x, na.rm = TRUE){
  x_mean <- mean(x, na.rm = na.rm)
  x_sd <- sd(x, na.rm = na.rm)
  result <- abs((x - x_mean)/x_sd)
  
  ans <- result < 3
  return(ans)
}

# Use that function to filter all rows
# Use if_all because all columns need to satisfied the condition
dat2 <- dat %>%
  group_by(group) %>%
  filter(if_all(ends_with("_Cz"), .fns = temp_fun))
库(dplyr)
#创建一个函数以确定向量中的数字是否为
#绝对值((x-x_平均值)/x_标准差)<3

为什么你需要做小组作业?你是在用
语句计算
之外的
平均值
sd
。你刚刚提到了我忽略的一点。事实上,我需要按组分别计算
平均值
sd
。Dang.@Late Mail我想可以肯定地说,我不需要使用我所做的函数,但我需要在
by
语句的
中计算
mean
sd
。我的计算表明,您的所有行都满足您指定的条件。你得到了同样的结果还是我误解了什么?@www是的,这可能就是为什么我没有得到任何数据或得到错误。所以,我知道我不再需要我做的函数了。在不使用该函数的情况下,像这样删除数据要容易得多,]
。我忽略了我需要为每组分别计算
平均值
sd
。我认为如果我不需要按组计算
平均值
sd
,这会起作用。本质上,我在寻找显著的异常值,这些异常值与平均值相差3个或更多标准差。这有助于删除许多不同的观察值,R将其识别为实际不存在的异常值。@AcidCatfish如果不需要通过
group
列来识别异常值,只需删除
group\u by(group)
行即可。我的计算表明,无论是否使用by
group
,所有86行都被保留,并且没有从示例中删除任何行。我很好奇,根据你的解决方案,你是否得到了相同的结果。不,根据我的解决方案,我没有得到相同的结果。这可能有很多原因。例如,我的实际数据集更大。但是,同样,当我按组
x@AcidCatfish分解数据时,感谢您的响应。我不完全确定我的解决方案是否有效。也许从一个较小的测试数据集开始,您可以确定哪些行有异常值,并将其删除以确保您的解决方案或我的解决方案有效。是的,我认为这是最好的。我想我找到了一种可行的方法,但它有很多方面。从我的完整数据集中,我在使用我的方法后删除了3个观察值(但这是错误的,因为我计算了整个数据的平均值,而不是按组计算)
library(dplyr)

# Create a funciton to determine if a number in a vector is 
# abs((x - x_mean)/x_sd) < 3
temp_fun <- function(x, na.rm = TRUE){
  x_mean <- mean(x, na.rm = na.rm)
  x_sd <- sd(x, na.rm = na.rm)
  result <- abs((x - x_mean)/x_sd)
  
  ans <- result < 3
  return(ans)
}

# Use that function to filter all rows
# Use if_all because all columns need to satisfied the condition
dat2 <- dat %>%
  group_by(group) %>%
  filter(if_all(ends_with("_Cz"), .fns = temp_fun))