R：创建&；分配重复记录_R_Duplicates_Uniqueidentifier

R：创建&；分配重复记录

R：创建&；分配重复记录,r,duplicates,uniqueidentifier,R,Duplicates,Uniqueidentifier,我有一系列的媒体来源，我必须给它们指定县名。对于只有一个县分配的某些源（例如，本地报纸），这相当简单——我基于开关创建了一个县名称变量，该开关基于源名称分配县名称。样本： switchfun <- function(x) {switch(x, 'Morning Call' = 'Lehigh', 'Inquirer' = 'Philadelphia', 'Daily Ledger' = 'Mercer', 'Null') } County.Name <- as.chara

我有一系列的媒体来源，我必须给它们指定县名。对于只有一个县分配的某些源（例如，本地报纸），这相当简单——我基于

开关创建了一个县名称变量，该开关基于源名称分配县名称。样本：
switchfun <- function(x) {switch(x, 'Morning Call' = 'Lehigh', 'Inquirer' =     
'Philadelphia', 'Daily Ledger' = 'Mercer', 'Null') }

County.Name <- as.character(lapply(Source, switchfun))

在当前文件中，NPR、美联社和雅虎新闻没有关联县（“NA”）
所需文件布局的dput

：

structure(list(Source = structure(c(5L, 2L, 4L, 3L, 7L, 1L, 6L
), .Label = c("Associated Press", "Daily Ledger", "Herald Tribune", 
"Inquirer", "Morning Call", "NPR", "Yahoo News"), class = "factor"), 
County = structure(c(1L, 2L, 4L, 3L, NA, NA, NA), .Label = c("Lehigh", 
"Mercer", "Montgomery", "Philadelphia"), class = "factor"), 
Score = c(3L, 10L, 4L, 8L, 1L, 3L, 6L)), .Names = c("Source", 
"County", "Score"), class = "data.frame", row.names = c(NA, -7L
))

structure(list(Source = structure(c(5L, 2L, 4L, 3L, 7L, 7L, 7L, 
7L, 1L, 1L, 1L, 1L, 6L, 6L, 6L, 6L), .Label = c("Associated Press", 
"Daily Ledger", "Herald Tribune", "Inquirer", "Morning Call", 
"NPR", "Yahoo News"), class = "factor"), County = structure(c(1L, 
2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L), .Label = c("Lehigh", 
"Mercer", "Montgomery", "Philadelphia"), class = "factor"), Score = c(3L, 
10L, 4L, 8L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L)), .Names = c("Source", 
"County", "Score"), class = "data.frame", row.names = c(NA, -16L
))

在所需的布局中，我将每个国家数据源及其得分分配给数据集中的四个县。e、雅虎新闻（Yahoo News）&它的得分为1，与利海、费城、蒙哥马利和默瑟县相关联的得分重复了4次。而雅虎新闻拥有“NA”县的记录消失。在我的实际数据集中，我有大约100个县，因此Yahoo News及其相关变量（例如分数、日期、作者等）——我总共有大约60个变量）将被复制100次。我还希望将这些新“复制”记录的县分配到country.Name变量中，该变量是我使用上面的

switch

函数创建的。我不需要2个县名称字段，我需要所有这些新创建的县都在County.Names下。

如果我理解正确，这可能是一种可能性：

# a (minimal) data frame with all unique source-county combinations
src_cnt <- data.frame(source = c("Morning Call", "AP", "AP", "AP"), county = c("Lehigh", "Lehigh", "Mercer", "Phila"))

# a data frame with a unique score for each source
src_score <- data.frame(source = c("Morning Call", "AP"), score = c(10, 3))

merge(src_cnt, src_score)

#具有所有唯一源县组合的（最小）数据帧
src_cnt如果您能为我们提供一些示例数据并显示所需结果，那就太好了。我想您可能正在寻找合并
，但如果没有更好的数据表示形式，很难说。抱歉，时间太晚了&我很累。更新了w/更多解释&dput
读数的再现性。我实际上有一个唯一的ID，也需要说明，所以我修改为：src\u cnt
# Assuming your current data is named dd
# select the national sources, i.e. the sources where County is missing
src_national <- dd$Source[is.na(dd$County)])

# select unique counties
counties <- unique(dd$County[!is.na(dd$County)])

# create all combinations of national sources and counties
src_cnt <- expand.grid(Source = src_national, County = counties)

# add score from current data to national sources
src_cnt2 <- merge(src_cnt, dd[is.na(dd$County), c("Source", "Score")], by = "Source")

# add national sources to local sources in dd
dd2 <- rbind(dd[!is.na(dd$County), ], src_cnt2)

# order by Sourcy and County
# assuming desired data is named `desired`
library(plyr)
desired2 <- arrange(df = desired, Source, County) 
dd2 <- arrange(df = dd2, Source, County)
all.equal(desired2, dd2)