识别R中两个数据集之间的特定差异

识别R中两个数据集之间的特定差异,r,R,我想比较两个数据集,并确定它们之间差异的具体实例(即,哪些变量不同) 虽然我已经了解了如何识别两个数据集之间哪些记录不相同(使用此处详述的函数:),但我不确定如何标记哪些变量不同 例如 数据集A: id name dob vaccinedate vaccinename dose 100000 John Doe 1/1/2000 5/20/2012 MMR 4 100001 Jane Doe 7/3/2011 3/

我想比较两个数据集,并确定它们之间差异的具体实例(即,哪些变量不同)

虽然我已经了解了如何识别两个数据集之间哪些记录不相同(使用此处详述的函数:),但我不确定如何标记哪些变量不同

例如

数据集A:

id      name        dob       vaccinedate  vaccinename  dose
100000  John Doe    1/1/2000  5/20/2012    MMR          4
100001  Jane Doe    7/3/2011  3/14/2013    VARICELLA    1
数据集B:

id      name        dob       vaccinedate  vaccinename  dose
100000  John Doe    1/1/2000  5/20/2012    MMR          3
100001  Jane Doee   7/3/2011  3/24/2013    VARICELLA    1
100002  John Smith  2/5/2010  7/13/2013    HEPB         3
我想确定哪些记录不同,哪些特定变量存在差异。例如,John Doe记录在
剂量
中有1个差异,而Jane Doe记录在
姓名
接种日期
中有2个差异。另外,数据集B还有一条不在数据集A中的附加记录,我也想识别这些实例

最后,目标是找出错误“类型”的频率,例如,有多少记录在接种日期、接种名称、剂量等方面存在差异


谢谢

一种可能性。首先,找出两个数据集的共同ID。最简单的方法是:

commonID<-intersect(A$id,B$id)
接下来,您可以将这两个数据集限制为它们的共同ID

Acommon<-A[A$id %in% commonID,]
Bcommon<-B[B$id %in% commonID,]
要查找名称不同的所有ID,请执行以下操作:

Acommon$id[diffs[,"name"]]
# [1] 100001

诸如此类。

这应该让您开始,但可能有更优雅的解决方案

首先,建立
df1
df2
,以便其他人可以快速复制:

df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))

df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
我们可以使用
sapply

num.discrep <- sapply(discrep, length)
num.discrep
# id        name         dob vaccinedate vaccinename        dose 
# 0           1           0           1           0           1 

有一个新的包叫waldo

install.packages("waldo")
library(waldo)

# construct the data frames


df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))

df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))

# compare them
compare(df1,df2)

谢谢我在上面的示例数据框中没有具体说明,但我的实际数据有每个id的多个记录。例如,John Doe可能有5种类型的疫苗,每个疫苗可能有多个剂量。在您的第一行代码中,我如何确定这两个数据集有哪些相同的行,而不仅仅是基于id?希望这是有道理的。这个问题没有具体的答案。问题是,如果两行不相同,那么如何确定它们是否“应该”相同但存在差异,或者它们实际上是否完全不同。你必须拿出一些标准来做决定,这是真的。其中一个数据集是“金标准”(来自纸质疫苗接种记录),而另一个数据集是单独以电子方式输入的,因此第一个数据集应该是“正确”的数据集。这有助于澄清问题吗?执行此审核的前一位人员在Excel中手动查看差异,发现>1000个错误。理想情况下,我希望避免这种手工工作!:)试试
Acommon$id[diffs[,"name"]]
# [1] 100001
df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))

df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
discrep <- mapply(setdiff, df1, df2)
discrep
# $id
# integer(0)
# 
# $name
# [1] "Jane Doe"
# 
# $dob
# character(0)
# 
# $vaccinedate
# [1] "3/14/2013"
# 
# $vaccinename
# character(0)
# 
# $dose
# [1] 4
num.discrep <- sapply(discrep, length)
num.discrep
# id        name         dob vaccinedate vaccinename        dose 
# 0           1           0           1           0           1 
map2(df1, df2, setdiff) %>% 
  map_int(length)
library(compareDF)

compare_df(dataframe1, dataframe2, c("columnname"))
install.packages("waldo")
library(waldo)

# construct the data frames


df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))

df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))

# compare them
compare(df1,df2)
`old` is length 2
`new` is length 3

`names(old)`: "X" "Y"    
`names(new)`: "X" "Y" "Z"

`attr(old, 'row.names')`: 1 2 3  
`attr(new, 'row.names')`: 1 2 3 4

`old$X`: 1 2 3  
`new$X`: 1 2 3 4

`old$Y`: "a" "b" "c"    
`new$Y`: "A" "b" "c" "d"

`old$Z` is absent
`new$Z` is a character vector ('k', 'l', 'm', 'n')