R删除不完全重复的重复记录
我有一个需要重复数据消除的记录列表,这些记录看起来像是同一组记录的组合,但使用常规函数来重复数据消除记录不起作用,因为这两列不是重复的。下面是一个可复制的示例R删除不完全重复的重复记录,r,duplicates,data-cleaning,R,Duplicates,Data Cleaning,我有一个需要重复数据消除的记录列表,这些记录看起来像是同一组记录的组合,但使用常规函数来重复数据消除记录不起作用,因为这两列不是重复的。下面是一个可复制的示例 df <- data.frame( A = c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"), B = c("43","501","502","2","501","502","
df <- data.frame( A = c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"),
B = c("43","501","502","2","501","502","491","496","490","481","2","43","502","2","43","501"))
下面是我想要的输出
df_Final <- data.frame( A = c("2","2","2","331","391","481"),
B = c("43","501","502","491","496","490"))
您可以删除在使用重新排序时重复的所有行
require(dplyr)
df %>%
apply(1, sort) %>% t %>%
data.frame %>%
group_by_all %>%
slice(1)
我想你想知道A列中的元素何时第一次出现在B列中 如果A中的元素不在B is.naidx中,或者A中的元素在B seq_alongidx
library(tidyverse)
df %>% mutate(idx = match(A, B)) %>%
filter(is.na(idx) | seq_along(idx) < idx) %>%
select(-idx)
您的输入和预期输出之间没有明显的联系。例如,A=43个条目会发生什么变化?虽然很明显您想要消除重复数据,但其背后的逻辑肯定不是直观的,也不容易从数据中推断出来。如果没有明确定义的规则,那么可以逐行检查输入数据,并解释为什么保留或丢弃该行。决定哪个向量保留该值的规则是什么?为什么2属于A,43属于B?
df[is.na(idx) | seq_along(idx) < idx,]
library(tidyverse)
df %>% mutate(idx = match(A, B)) %>%
filter(is.na(idx) | seq_along(idx) < idx) %>%
select(-idx)