通过在R中的data.table中输入关键字,查找一组变量中最常见的匹配项
我试图通过R中的通过在R中的data.table中输入关键字,查找一组变量中最常见的匹配项,r,dataframe,data.table,apply,R,Dataframe,Data.table,Apply,我试图通过R中的数据表中的一个键来查找一组变量中最常见的事件。下面是我尝试做的一个小示例: library(data.table) mydata <- data.table(mergedName=c("JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE","JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE"),
数据表中的一个键来查找一组变量中最常见的事件。下面是我尝试做的一个小示例:
library(data.table)
mydata <- data.table(mergedName=c("JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE","JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE"),
job=c("teacher","teacher","teacher","teacher","teacher","teacher","police","police","police","police","police","police"),
from=c("NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG"),
misspelled_NYT=c("John Doe", NA, NA, "Mary White", NA, NA,"John_Doe", NA, NA, "Mary*White", NA, NA),
misspelled_USAT=c(NA, "JohnDOE", NA, NA, "Mary White", NA, NA, "John Doe", NA, NA, "Mary White", NA),
misspelled_BG=c(NA, NA, "John Doe", NA, NA, "Mary-White", NA, NA, "John Doe", NA, NA, "Mary White"))
setkeyv(mydata, cols=c("mergedName","job"))
下面是我期望的输出(对于mergedName
&job
的每个键控组合,三个源中最常见的名称拼写):
我已经能够用宽格式的数据帧实现这一点。下面是一个代码的小示例,用于以广泛的形式执行此操作---注意:由于某些原因,这似乎只适用于较大的数据帧,但在下面的示例中不起作用,即使代码是相同的。跨行应用到此DF的table()
输出与我预期的不同:
mydataWide <- data.frame(mergedName=c("JOHNDOE","MARYWHITE","JOHNDOE","MARYWHITE"),
job=c("teacher","police","teacher","police"),
misspelled_NYT=c("John Doe", "Mary White", "John_Doe", "Mary*White"),
misspelled_USAT=c("JohnDOE", "Mary White", "John Doe", "Mary White"),
misspelled_BG=c("John Doe", "Mary-White", "John Doe", "Mary White"),
stringsAsFactors=FALSE)
nametable <- apply(mydataWide[,paste("misspelled", c("NYT","USAT","BG"), sep="_")], 1, function(x) sort(table(x), decreasing=TRUE))
mydataWide$actualSpelling <- names(sapply(nametable,`[`, 1) )
mydataWide您可以首先将mydata
融化成long
表单,使用NA.omit
删除NA
行,使用which.max
和表查找max
实际拼写的计数(按mergedName
和job
分组)。使用数字索引获取具有最大频率的术语
library(data.table)
melt(mydata, id.vars=c('mergedName', 'job'), measure.vars=4:6,
na.rm=TRUE, value.name='actualSpelling')[,
actualSpelling:= names(which.max(table(actualSpelling))),
by=list(mergedName, job)][order(mergedName), -3]
# mergedName job actualSpelling
#1: JOHNDOE police John Doe
#2: JOHNDOE teacher John Doe
#3: JOHNDOE police John Doe
#4: JOHNDOE teacher John Doe
#5: JOHNDOE police John Doe
#6: JOHNDOE teacher John Doe
#7: MARYWHITE police Mary White
#8: MARYWHITE teacher Mary White
#9: MARYWHITE police Mary White
#10: MARYWHITE teacher Mary White
#11: MARYWHITE police Mary White
#12: MARYWHITE teacher Mary White
您可以首先将mydata
融化为long
表单,使用NA删除NA
行。忽略,使用which.max
和表查找实际拼写的max
计数(按mergedName
和job
分组)。使用数字索引获取具有最大频率的术语
library(data.table)
melt(mydata, id.vars=c('mergedName', 'job'), measure.vars=4:6,
na.rm=TRUE, value.name='actualSpelling')[,
actualSpelling:= names(which.max(table(actualSpelling))),
by=list(mergedName, job)][order(mergedName), -3]
# mergedName job actualSpelling
#1: JOHNDOE police John Doe
#2: JOHNDOE teacher John Doe
#3: JOHNDOE police John Doe
#4: JOHNDOE teacher John Doe
#5: JOHNDOE police John Doe
#6: JOHNDOE teacher John Doe
#7: MARYWHITE police Mary White
#8: MARYWHITE teacher Mary White
#9: MARYWHITE police Mary White
#10: MARYWHITE teacher Mary White
#11: MARYWHITE police Mary White
#12: MARYWHITE teacher Mary White
您可以首先将mydata
融化为long
表单,使用NA删除NA
行。忽略,使用which.max
和表查找实际拼写的max
计数(按mergedName
和job
分组)。使用数字索引获取具有最大频率的术语
library(data.table)
melt(mydata, id.vars=c('mergedName', 'job'), measure.vars=4:6,
na.rm=TRUE, value.name='actualSpelling')[,
actualSpelling:= names(which.max(table(actualSpelling))),
by=list(mergedName, job)][order(mergedName), -3]
# mergedName job actualSpelling
#1: JOHNDOE police John Doe
#2: JOHNDOE teacher John Doe
#3: JOHNDOE police John Doe
#4: JOHNDOE teacher John Doe
#5: JOHNDOE police John Doe
#6: JOHNDOE teacher John Doe
#7: MARYWHITE police Mary White
#8: MARYWHITE teacher Mary White
#9: MARYWHITE police Mary White
#10: MARYWHITE teacher Mary White
#11: MARYWHITE police Mary White
#12: MARYWHITE teacher Mary White
您可以首先将mydata
融化为long
表单,使用NA删除NA
行。忽略,使用which.max
和表查找实际拼写的max
计数(按mergedName
和job
分组)。使用数字索引获取具有最大频率的术语
library(data.table)
melt(mydata, id.vars=c('mergedName', 'job'), measure.vars=4:6,
na.rm=TRUE, value.name='actualSpelling')[,
actualSpelling:= names(which.max(table(actualSpelling))),
by=list(mergedName, job)][order(mergedName), -3]
# mergedName job actualSpelling
#1: JOHNDOE police John Doe
#2: JOHNDOE teacher John Doe
#3: JOHNDOE police John Doe
#4: JOHNDOE teacher John Doe
#5: JOHNDOE police John Doe
#6: JOHNDOE teacher John Doe
#7: MARYWHITE police Mary White
#8: MARYWHITE teacher Mary White
#9: MARYWHITE police Mary White
#10: MARYWHITE teacher Mary White
#11: MARYWHITE police Mary White
#12: MARYWHITE teacher Mary White
在melt
的语法中,您可以这样做:melt(mydata,id.vars=1:2,measure.vars=4:6,na.rm=TRUE)
-当然使用列名会更安全。然后不需要子集
或na.省略
。在melt
的语法中,您可以执行:melt(mydata,id.vars=1:2,measure.vars=4:6,na.rm=TRUE)
-当然,使用列名会更安全。然后不需要子集
或na.省略
。在melt
的语法中,您可以执行:melt(mydata,id.vars=1:2,measure.vars=4:6,na.rm=TRUE)
-当然,使用列名会更安全。然后不需要子集
或na.省略
。在melt
的语法中,您可以执行:melt(mydata,id.vars=1:2,measure.vars=4:6,na.rm=TRUE)
-当然,使用列名会更安全。然后无需子集
或na.省略
。