R 从数据帧中的重复测量中检测不可能的数据输入错误_R_Dataframe

R 从数据帧中的重复测量中检测不可能的数据输入错误

r dataframe

R 从数据帧中的重复测量中检测不可能的数据输入错误,r,dataframe,R,Dataframe,我必须检查巨大的数据库，重复测量个人的几个变量。由于我可以有超过300万的观察结果，我想至少删除那些我确信是数据输入错误的数据连续变量例如，关注可变权重（例如下面的数据框），我知道个体在一次观察和下一次观察之间的体重减少不能超过40%。我怎样才能检测到那些体重减轻程度更高的观察结果，就像对个体“2”的第三次观察结果一样，该个体的体重从30克减至3克分类变量例如，关于个人的地位。一个个体可分为3种状态（例如“未成年”、“成年非繁殖者”或“成年繁殖者”；分别为1、2和3）。我知道，如果一个人

我必须检查巨大的数据库，重复测量个人的几个变量。由于我可以有超过300万的观察结果，我想至少删除那些我确信是数据输入错误的数据

连续变量

例如，关注可变权重（例如下面的数据框），我知道个体在一次观察和下一次观察之间的体重减少不能超过40%。我怎样才能检测到那些体重减轻程度更高的观察结果，就像对个体“2”的第三次观察结果一样，该个体的体重从30克减至3克

分类变量

例如，关于个人的地位。一个个体可分为3种状态（例如“未成年”、“成年非繁殖者”或“成年繁殖者”；分别为1、2和3）。我知道，如果一个人是成年人（“2”或“3”），他就不可能成为青少年（“1”），但有可能在3-->2之间过渡。在这种特殊情况下，我想检测观察结果9，其中个体“3”被归类为“青少年”，但在之前的观察结果中被归类为“成人”

个人根据您的描述，仅根据您上面提到的“问题”，尝试以下方法：
Individuals <- c(1,1,1,2,2,2,3,3,3)
Weight <- c(10, 14, 20, 15, 30, 3, 12, 34, 30)
Week <- rep(1:3, 3)
Status <- c(1, 2, 3, 2, 3, 3, 2, 3, 1)
df <- as.data.frame (cbind(Individuals, Weight, Week, Status))

library(dplyr)

df %>%
  group_by(Individuals) %>%      ## for each individual
  mutate(WeightReduce = 1-Weight/dplyr::lag(Weight, default = Weight[1])) %>%  ## calculate the weight reduce (negative numbers here mean weight increase)
  ungroup() %>%                  ## forget the grouping
  mutate(flag = ifelse(WeightReduce >= 0.4 | dplyr::lag(Status, default = Status[1]) %in% 2:3 & Status == 1, 1, 0))  ## flag errors based on filters


#    Individuals Weight  Week Status WeightReduce  flag
#          (dbl)  (dbl) (dbl)  (dbl)        (dbl) (dbl)
# 1           1     10     1      1    0.0000000     0
# 2           1     14     2      2   -0.4000000     0
# 3           1     20     3      3   -0.4285714     0
# 4           2     15     1      2    0.0000000     0
# 5           2     30     2      3   -1.0000000     0
# 6           2      3     3      3    0.9000000     1
# 7           3     12     1      2    0.0000000     0
# 8           3     34     2      3   -1.8333333     0
# 9           3     30     3      1    0.1176471     1

personals=0.4 | dplyr:：lag（状态，默认值=状态[1]）%in%2:3&Status==1,1,0））35;#基于过滤器标记错误
#个人体重周状态减重标志
#（dbl）（dbl）（dbl）（dbl）（dbl）（dbl）（dbl）（dbl）
# 1           1     10     1      1    0.0000000     0
# 2           1     14     2      2   -0.4000000     0
# 3           1     20     3      3   -0.4285714     0
# 4           2     15     1      2    0.0000000     0
# 5           2     30     2      3   -1.0000000     0
# 6           2      3     3      3    0.9000000     1
# 7           3     12     1      2    0.0000000     0
# 8           3     34     2      3   -1.8333333     0
# 9           3     30     3      1    0.1176471     1
您可以使用data.table
包计算重量变化率和青少年异常，然后根据以下两个标准进行过滤：
library(data.table)

setDT(df)[,c('continuous', 'categorical'):=list(
              c(0,diff(Weight)/head(Weight, -1)),  # rate of weight change per individual
              Status==1 & c(F,diff(Status)<0)),Individuals][ 
          continuous>=-0.4 & !categorical,][]

#   Individuals Weight Week Status    change continuous categorical
#1:           1     10    1      1 0.0000000  0.0000000       FALSE
#2:           1     14    2      2 0.4000000  0.4000000       FALSE
#3:           1     20    3      3 0.4285714  0.4285714       FALSE
#4:           2     15    1      2 0.0000000  0.0000000       FALSE
#5:           2     30    2      3 1.0000000  1.0000000       FALSE
#6:           3     12    1      2 0.0000000  0.0000000       FALSE
#7:           3     34    2      3 1.8333333  1.8333333       FALSE

库（data.table）
setDT（df）[，c（'continuous'，'categorical'）：=list(
c（0，差异（重量）/头（重量，-1）），每个个体的体重变化率
状态==1&c（F，差异（状态）=-0.4&！分类，][]
#个人体重周状态变化连续分类
#1:11011000000000.0000000假
#2:1142200000.4000000假
#3:120330428571404285714假
#4:21512000000000假
#5:2302310000000110000000假
#6:31220000000假
#7:334231.8333333 1.8333333假
我希望这能有所帮助
library(data.table)
  library(zoo)
  df <- data.table(df)
  # used to check percentage change in weight variable
  calcreduction <- function(x){
    res <- diff(x)/x[-length(x)]
    return(c(0,res))
  }
  # this will make it easy to get rid of values where WeightReduction < -.4

  #function used to assign combination type
  # you can have 11,12,13,22,23,32,33 or 21,31. The latter are "bad"
  getcomb <- function(x){
    res <- rbind(c(0,0),rollapply(x,2,paste))
    return(paste(res[,1],res[,2],sep=""))
  } 
  # this will make it easy to get rid of values where the Status change is no good

  # you can just pull the new vectors and then use logic
  # to decide what you want to do with these values
  res <- df[,list("WeightReduction"=calcreduction(Weight),
                  "StatusChange"=getcomb(Status),Weight,Week,Status),by=Individuals]

> res
   Individuals WeightReduction StatusChange Weight Week Status
1:           1       0.0000000           00     10    1      1
2:           1       0.4000000           12     14    2      2
3:           1       0.4285714           23     20    3      3
4:           2       0.0000000           00     15    1      2
5:           2       1.0000000           23     30    2      3
6:           2      -0.9000000           33      3    3      3
7:           3       0.0000000           00     12    1      2
8:           3       1.8333333           23     34    2      3
9:           3      -0.1176471           31     30    3      1

库（data.table）
图书馆（动物园）
df那么，您想删除个人2和3的特定行，还是删除这些个人的每一行？为了检测这些行，稍后，根据每种情况，我将决定删除这些个人的行或每一行。我非常确定您可以使用data.table
包。我有一个类似的示例，它要求一个mu有多行但是，我没有能力在这里帮助你。你会在这里得到一些很好的答案。你可以使用dplyr
，data.table
，或base R。主要的是能够创建一个问题列表，你知道这些问题将帮助你发现数据输入错误，然后尝试想象它们在数据集中的外观，最后创建适当的过滤器来标记它们。它可以完美地检测“错误”的观察结果，但我希望函数能够保留错误的观察结果，以便检查我是否必须专门删除观察结果或整个个体。在这种情况下，您可以删除我的代码的最后一行（哪些过滤器），您就完成了；）它工作得非常完美，唯一的一点是我必须用stats:：lag更改'dplyr:：lag'。是的，它使用stats:：lag
工作。如果我使用'dplyr:：lag'，它会给出错误：“lag
不是命名空间：dplyr
中的导出对象”有趣。我有相反的行为。不适用于stats:：lag
。可能是您正在使用的R或dplyr版本。您能试试lag吗？可能适合您。。。
library(data.table)
  library(zoo)
  df <- data.table(df)
  # used to check percentage change in weight variable
  calcreduction <- function(x){
    res <- diff(x)/x[-length(x)]
    return(c(0,res))
  }
  # this will make it easy to get rid of values where WeightReduction < -.4

  #function used to assign combination type
  # you can have 11,12,13,22,23,32,33 or 21,31. The latter are "bad"
  getcomb <- function(x){
    res <- rbind(c(0,0),rollapply(x,2,paste))
    return(paste(res[,1],res[,2],sep=""))
  } 
  # this will make it easy to get rid of values where the Status change is no good

  # you can just pull the new vectors and then use logic
  # to decide what you want to do with these values
  res <- df[,list("WeightReduction"=calcreduction(Weight),
                  "StatusChange"=getcomb(Status),Weight,Week,Status),by=Individuals]

> res
   Individuals WeightReduction StatusChange Weight Week Status
1:           1       0.0000000           00     10    1      1
2:           1       0.4000000           12     14    2      2
3:           1       0.4285714           23     20    3      3
4:           2       0.0000000           00     15    1      2
5:           2       1.0000000           23     30    2      3
6:           2      -0.9000000           33      3    3      3
7:           3       0.0000000           00     12    1      2
8:           3       1.8333333           23     34    2      3
9:           3      -0.1176471           31     30    3      1