R 使用重复但不同的因子级别清理源数据和目标数据
我有一些地理信息系统数据,包括起点和目的地OD,以及OD当天的时间信息。我打算制作一张地图,并根据一天中的时间信息给ODs上色 一件事是,一些ODs在数据集中有白天和黑夜,并且可能有不同的顺序。我想对这些进行不同的标记,例如白天/晚上R 使用重复但不同的因子级别清理源数据和目标数据,r,duplicates,data.table,gis,data-cleaning,R,Duplicates,Data.table,Gis,Data Cleaning,我有一些地理信息系统数据,包括起点和目的地OD,以及OD当天的时间信息。我打算制作一张地图,并根据一天中的时间信息给ODs上色 一件事是,一些ODs在数据集中有白天和黑夜,并且可能有不同的顺序。我想对这些进行不同的标记,例如白天/晚上 library(data.table) Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London") Destination<-c("Paris", "London", "Be
library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]
有没有一个简单的方法可以做到这一点?我的MWE只是一个OD,但我需要在其他几个OD中识别它。无论顺序如何,我都能找到副本,但我不知道如何找出是否有两个时间案例,以及如何用白天/晚上来替换它们
library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]
您可以使用dplyr包来实现这一点,如下所示 请随意更改条件以满足您的需要
library(data.table)
library(dplyr)
# Creating data
dt <-
data.table(
Origin = c("London", "Paris", "Italy", "Spain", "Portugal", "Poland"),
Destination = c("Paris", "London", "Norway", "Portugal", "Spain", "Spain"),
Time = c("Day", "Night", "Day", NA_character_, NA_character_, NA_character_)
)
dt
# Origin Destination Time
# London Paris Day
# Paris London Night
# Italy Norway Day
# Spain Portugal <NA>
# Portugal Spain <NA>
# Poland Spain <NA>
dt %>%
# pmin and pmax are used to sort the 2 columns
# in order to group by them regardless to their order
group_by(Origin2 = pmin(Origin, Destination),
Destination2 = pmax(Origin, Destination)) %>%
mutate(count = n(), # To check if Origin/destination are repeated or not
row = row_number(), # Place holder to know if it was first to repeat or second
# If not repeated then make Time = Day
# If repeated and first occurance then Time = Day
# If repeated and second occurance then Time = Night
Time = case_when(count == 1 ~ "Day",
count == 2 & row == 1 ~ "Day",
count == 2 & row == 2 ~ "Night")) %>%
ungroup() %>%
select(Origin, Destination, Time)
# Origin Destination Time
# <chr> <chr> <chr>
# 1 London Paris Day
# 2 Paris London Night
# 3 Italy Norway Day
# 4 Spain Portugal Day
# 5 Portugal Spain Night
# 6 Poland Spain Day
感谢@Nareman Darwisch提供的dplyr解决方案,它为我提供了使用data.table解决方案的灵感 我正在创建一个新变量,作为每个始发地和目的地的唯一ID
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt.temp<-data.table(dat.sort)
dt.temp[,unique.name:=paste(V1,V2)]
dt$unique.name<-factor(dt.temp$unique.name)
我想了解的是,我如何能够本着按组查看级别的精神使用逻辑条件,并将这些级别与我想要的案例进行比较
dt[,No.levels.logi:=sum(levels(Time) %in% c("Day", "Night"))>1 , by=unique.name]
但是我猜levels命令总是给我所有三个级别。如果我理解正确,OP希望 识别城市对,而不考虑起点和目的地的顺序,例如,伦敦-巴黎与巴黎-伦敦属于同一城市对 如果城市对日夜或日夜运行,则折叠单独的行 或者更新原始数据集 这就是我要做的:
library(data.table)
dt <- data.table(Origin, Destination, Time)
# add city pair as unique grouping variable
dt[, Pair := paste(pmin(Origin, Destination), pmax(Origin, Destination), sep = "-")][]
# identify city pairs which are operated day and night
pairs_DN <- dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair][(V1), .(Pair)]
# update original dataset by an update join
dt[pairs_DN, on = "Pair", Time := "Day/Night"][]
关键点是确定满足第二个要求的城市对:
dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair]
因此,没有必要处理因素水平。顺便说一句,因子水平是整个列的一个属性,在子集或分组时不会改变。改变的是在子集或组中使用了哪些级别
pairs\u DN包含这些城市对的唯一密钥
那么,您是否正在尝试查找是否存在同两个国家/地区在白天有两条记录的情况?我正在尝试查找在白天和晚上运行的同一始发地/目的地的记录,并希望在夜间/白天对其重新编码
dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair]
Pair V1
1: London-Paris TRUE
2: Berlin-Lisbon FALSE
3: Lisbon-Madrid TRUE
Pair
1: London-Paris
2: Lisbon-Madrid