R 两行中的一行消除了具有更多NAs的一行_R_Dataframe_Data Cleaning_Data Munging

R 两行中的一行消除了具有更多NAs的一行

r dataframe

R 两行中的一行消除了具有更多NAs的一行,r,dataframe,data-cleaning,data-munging,R,Dataframe,Data Cleaning,Data Munging,我正在寻找一种方法来检查数据帧中的两列是否包含一行或多行的相同元素，然后删除包含更多NAs的行假设我们有这样一个数据帧： x <- data.frame("Year" = c(2017,2017,2017,2018,2018), "Country" = c("Sweden", "Sweden", "Norway", "Denmark", "Finland"), "Sales" = c(15, 15, 18, 13, 12),

我正在寻找一种方法来检查数据帧中的两列是否包含一行或多行的相同元素，然后删除包含更多NAs的行

假设我们有这样一个数据帧：

x <- data.frame("Year" = c(2017,2017,2017,2018,2018),
            "Country" = c("Sweden", "Sweden", "Norway", "Denmark", "Finland"),
            "Sales" = c(15, 15, 18, 13, 12),
            "Campaigns" = c(3, NA, 4, 1, 1),
            "Employees" = c(15, 15, 12, 8, 9),
            "Satisfaction" = c(0.8, NA, 0.9, 0.95, 0.87),
            "Expenses" = c(NA, NA, 9000, 7500, 4300))

x我们可以使用data.table方法
library(data.table)
ind <-  setDT(x)[,  {
     i1 <- Reduce(`+`, lapply(.SD, is.na))
    .I[i1 > 0 & (i1 == max(i1))]
    }, .(Year, Country)]$V1
x[-ind]
#    Year Country Sales Campaigns Employees Satisfaction Expenses
#1: 2017  Sweden    15         3        15         0.80       NA
#2: 2017  Norway    18         4        12         0.90     9000
#3: 2018 Denmark    13         1         8         0.95     7500
#4: 2018 Finland    12         1         9         0.87     4300

库（data.table）
ind我们可以使用data.table方法
library(data.table)
ind <-  setDT(x)[,  {
     i1 <- Reduce(`+`, lapply(.SD, is.na))
    .I[i1 > 0 & (i1 == max(i1))]
    }, .(Year, Country)]$V1
x[-ind]
#    Year Country Sales Campaigns Employees Satisfaction Expenses
#1: 2017  Sweden    15         3        15         0.80       NA
#2: 2017  Norway    18         4        12         0.90     9000
#3: 2018 Denmark    13         1         8         0.95     7500
#4: 2018 Finland    12         1         9         0.87     4300

库（data.table）
ind使用dplyr
：
library(dplyr)
x %>%
  mutate(n_na = rowSums(is.na(.))) %>%  ## calculate NAs for each row      
  group_by(Year, Country) %>%           ## for each year/country
  arrange(n_na) %>%                       ## sort by number of NAs
  slice(1) %>%                            ## take the first row
  select(-n_na)                           ## remove the NA counter column
# A tibble: 4 x 7
# Groups:   Year, Country [4]
   Year Country Sales Campaigns Employees Satisfaction Expenses
  <dbl>  <fctr> <dbl>     <dbl>     <dbl>        <dbl>    <dbl>
1  2017  Norway    18         4        12         0.90     9000
2  2017  Sweden    15         3        15         0.80       NA
3  2018 Denmark    13         1         8         0.95     7500
4  2018 Finland    12         1         9         0.87     4300

库（dplyr）
x%>%
mutate（n_na=rowSums（is.na（））%>%##为每行计算NAs
按（年份、国家）分组，每年/国家的百分比>百分比
排列（n_na）%>%##按NAs数量排序
切片（1）%>%##取第一行
选择（-n_na）##删除na计数器列
#一个tibble:4x7
#分组：年份、国家[4]
年度国家/地区销售活动员工满意度费用
1 2017挪威18 4 12 0.90 9000
2017年2月瑞典15 3 15 0.80 NA
3 2018丹麦13 1 8 0.95 7500
4 2018芬兰12 19 0.87 4300
使用dplyr
：
library(dplyr)
x %>%
  mutate(n_na = rowSums(is.na(.))) %>%  ## calculate NAs for each row      
  group_by(Year, Country) %>%           ## for each year/country
  arrange(n_na) %>%                       ## sort by number of NAs
  slice(1) %>%                            ## take the first row
  select(-n_na)                           ## remove the NA counter column
# A tibble: 4 x 7
# Groups:   Year, Country [4]
   Year Country Sales Campaigns Employees Satisfaction Expenses
  <dbl>  <fctr> <dbl>     <dbl>     <dbl>        <dbl>    <dbl>
1  2017  Norway    18         4        12         0.90     9000
2  2017  Sweden    15         3        15         0.80       NA
3  2018 Denmark    13         1         8         0.95     7500
4  2018 Finland    12         1         9         0.87     4300

库（dplyr）
x%>%
mutate（n_na=rowSums（is.na（））%>%##为每行计算NAs
按（年份、国家）分组，每年/国家的百分比>百分比
排列（n_na）%>%##按NAs数量排序
切片（1）%>%##取第一行
选择（-n_na）##删除na计数器列
#一个tibble:4x7
#分组：年份、国家[4]
年度国家/地区销售活动员工满意度费用
1 2017挪威18 4 12 0.90 9000
2017年2月瑞典15 3 15 0.80 NA
3 2018丹麦13 1 8 0.95 7500
4 2018芬兰12 19 0.87 4300
基本R解决方案：
x$nas <- rowSums(sapply(x, is.na))
do.call(rbind,
        by(x, x[c("Year","Country")],
           function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
#   Year Country Sales Campaigns Employees Satisfaction Expenses nas
# 4 2018 Denmark    13         1         8         0.95     7500   0
# 5 2018 Finland    12         1         9         0.87     4300   0
# 3 2017  Norway    18         4        12         0.90     9000   0
# 1 2017  Sweden    15         3        15         0.80       NA   1

基本R解决方案：
x$nas <- rowSums(sapply(x, is.na))
do.call(rbind,
        by(x, x[c("Year","Country")],
           function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
#   Year Country Sales Campaigns Employees Satisfaction Expenses nas
# 4 2018 Denmark    13         1         8         0.95     7500   0
# 5 2018 Finland    12         1         9         0.87     4300   0
# 3 2017  Norway    18         4        12         0.90     9000   0
# 1 2017  Sweden    15         3        15         0.80       NA   1

您可以添加所需的输出吗？Friedemann，如果其中一个答案满足您的需要，请通过选择其左侧的复选标记（可选向上投票您认为有用的任何或所有选项）来“接受”它（仅一个）。对于迟交的答案，我深表歉意。我测试了解决方案，它们都对我有效。我已经接受了使用dplyr的解决方案，因为它对我来说是最优雅的，但这只是主观的。你能添加你想要的输出吗？Friedemann，如果其中一个答案满足你的需要，请通过选择它左边的复选标记来“接受”（仅一个）（可以选择向上投票任何或所有你认为有用的答案）。抱歉回答太晚。我测试了解决方案，它们都对我有效。我已经接受了使用dplyr的解决方案，因为它对我来说是最优雅的，但这只是主观的。