Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/66.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
通过在R中的data.table中输入关键字,查找一组变量中最常见的匹配项_R_Dataframe_Data.table_Apply - Fatal编程技术网

通过在R中的data.table中输入关键字,查找一组变量中最常见的匹配项

通过在R中的data.table中输入关键字,查找一组变量中最常见的匹配项,r,dataframe,data.table,apply,R,Dataframe,Data.table,Apply,我试图通过R中的数据表中的一个键来查找一组变量中最常见的事件。下面是我尝试做的一个小示例: library(data.table) mydata <- data.table(mergedName=c("JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE","JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE"),

我试图通过R中的
数据表中的一个键来查找一组变量中最常见的事件。下面是我尝试做的一个小示例:

library(data.table)

mydata <- data.table(mergedName=c("JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE","JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE"),
                     job=c("teacher","teacher","teacher","teacher","teacher","teacher","police","police","police","police","police","police"),
                     from=c("NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG"),
                     misspelled_NYT=c("John Doe", NA, NA, "Mary White", NA, NA,"John_Doe", NA, NA, "Mary*White", NA, NA),
                     misspelled_USAT=c(NA, "JohnDOE", NA, NA, "Mary White", NA, NA, "John Doe", NA, NA, "Mary White", NA),
                     misspelled_BG=c(NA, NA, "John Doe", NA, NA, "Mary-White", NA, NA, "John Doe", NA, NA, "Mary White"))

setkeyv(mydata, cols=c("mergedName","job"))
下面是我期望的输出(对于
mergedName
&
job
的每个键控组合,三个源中最常见的名称拼写):

我已经能够用宽格式的数据帧实现这一点。下面是一个代码的小示例,用于以广泛的形式执行此操作---注意:由于某些原因,这似乎只适用于较大的数据帧,但在下面的示例中不起作用,即使代码是相同的。跨行应用到此DF的
table()
输出与我预期的不同:

mydataWide <- data.frame(mergedName=c("JOHNDOE","MARYWHITE","JOHNDOE","MARYWHITE"),
                         job=c("teacher","police","teacher","police"),
                         misspelled_NYT=c("John Doe", "Mary White", "John_Doe", "Mary*White"),
                         misspelled_USAT=c("JohnDOE", "Mary White", "John Doe", "Mary White"),
                         misspelled_BG=c("John Doe", "Mary-White", "John Doe", "Mary White"),
                         stringsAsFactors=FALSE)

nametable <- apply(mydataWide[,paste("misspelled", c("NYT","USAT","BG"), sep="_")], 1, function(x) sort(table(x), decreasing=TRUE))
mydataWide$actualSpelling <- names(sapply(nametable,`[`, 1) )

mydataWide您可以首先将
mydata
融化成
long
表单,使用
NA.omit
删除
NA
行,使用
which.max
表查找
max
实际拼写的
计数(按
mergedName
job
分组)。使用数字索引获取具有最大频率的术语

 library(data.table)
 melt(mydata, id.vars=c('mergedName', 'job'), measure.vars=4:6,
    na.rm=TRUE, value.name='actualSpelling')[,
      actualSpelling:= names(which.max(table(actualSpelling))), 
      by=list(mergedName, job)][order(mergedName), -3]


 #   mergedName     job actualSpelling
 #1:    JOHNDOE  police       John Doe
 #2:    JOHNDOE teacher       John Doe
 #3:    JOHNDOE  police       John Doe
 #4:    JOHNDOE teacher       John Doe
 #5:    JOHNDOE  police       John Doe
 #6:    JOHNDOE teacher       John Doe
 #7:  MARYWHITE  police     Mary White
 #8:  MARYWHITE teacher     Mary White
 #9:  MARYWHITE  police     Mary White
#10:  MARYWHITE teacher     Mary White
#11:  MARYWHITE  police     Mary White
#12:  MARYWHITE teacher     Mary White

您可以首先将
mydata
融化为
long
表单,使用
NA删除
NA
行。忽略
,使用
which.max
表查找
实际拼写
max
计数(按
mergedName
job
分组)。使用数字索引获取具有最大频率的术语

 library(data.table)
 melt(mydata, id.vars=c('mergedName', 'job'), measure.vars=4:6,
    na.rm=TRUE, value.name='actualSpelling')[,
      actualSpelling:= names(which.max(table(actualSpelling))), 
      by=list(mergedName, job)][order(mergedName), -3]


 #   mergedName     job actualSpelling
 #1:    JOHNDOE  police       John Doe
 #2:    JOHNDOE teacher       John Doe
 #3:    JOHNDOE  police       John Doe
 #4:    JOHNDOE teacher       John Doe
 #5:    JOHNDOE  police       John Doe
 #6:    JOHNDOE teacher       John Doe
 #7:  MARYWHITE  police     Mary White
 #8:  MARYWHITE teacher     Mary White
 #9:  MARYWHITE  police     Mary White
#10:  MARYWHITE teacher     Mary White
#11:  MARYWHITE  police     Mary White
#12:  MARYWHITE teacher     Mary White

您可以首先将
mydata
融化为
long
表单,使用
NA删除
NA
行。忽略
,使用
which.max
表查找
实际拼写
max
计数(按
mergedName
job
分组)。使用数字索引获取具有最大频率的术语

 library(data.table)
 melt(mydata, id.vars=c('mergedName', 'job'), measure.vars=4:6,
    na.rm=TRUE, value.name='actualSpelling')[,
      actualSpelling:= names(which.max(table(actualSpelling))), 
      by=list(mergedName, job)][order(mergedName), -3]


 #   mergedName     job actualSpelling
 #1:    JOHNDOE  police       John Doe
 #2:    JOHNDOE teacher       John Doe
 #3:    JOHNDOE  police       John Doe
 #4:    JOHNDOE teacher       John Doe
 #5:    JOHNDOE  police       John Doe
 #6:    JOHNDOE teacher       John Doe
 #7:  MARYWHITE  police     Mary White
 #8:  MARYWHITE teacher     Mary White
 #9:  MARYWHITE  police     Mary White
#10:  MARYWHITE teacher     Mary White
#11:  MARYWHITE  police     Mary White
#12:  MARYWHITE teacher     Mary White

您可以首先将
mydata
融化为
long
表单,使用
NA删除
NA
行。忽略
,使用
which.max
表查找
实际拼写
max
计数(按
mergedName
job
分组)。使用数字索引获取具有最大频率的术语

 library(data.table)
 melt(mydata, id.vars=c('mergedName', 'job'), measure.vars=4:6,
    na.rm=TRUE, value.name='actualSpelling')[,
      actualSpelling:= names(which.max(table(actualSpelling))), 
      by=list(mergedName, job)][order(mergedName), -3]


 #   mergedName     job actualSpelling
 #1:    JOHNDOE  police       John Doe
 #2:    JOHNDOE teacher       John Doe
 #3:    JOHNDOE  police       John Doe
 #4:    JOHNDOE teacher       John Doe
 #5:    JOHNDOE  police       John Doe
 #6:    JOHNDOE teacher       John Doe
 #7:  MARYWHITE  police     Mary White
 #8:  MARYWHITE teacher     Mary White
 #9:  MARYWHITE  police     Mary White
#10:  MARYWHITE teacher     Mary White
#11:  MARYWHITE  police     Mary White
#12:  MARYWHITE teacher     Mary White

melt
的语法中,您可以这样做:
melt(mydata,id.vars=1:2,measure.vars=4:6,na.rm=TRUE)
-当然使用列名会更安全。然后不需要
子集
na.省略
。在
melt
的语法中,您可以执行:
melt(mydata,id.vars=1:2,measure.vars=4:6,na.rm=TRUE)
-当然,使用列名会更安全。然后不需要
子集
na.省略
。在
melt
的语法中,您可以执行:
melt(mydata,id.vars=1:2,measure.vars=4:6,na.rm=TRUE)
-当然,使用列名会更安全。然后不需要
子集
na.省略
。在
melt
的语法中,您可以执行:
melt(mydata,id.vars=1:2,measure.vars=4:6,na.rm=TRUE)
-当然,使用列名会更安全。然后无需
子集
na.省略