Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/83.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 使用重复但不同的因子级别清理源数据和目标数据_R_Duplicates_Data.table_Gis_Data Cleaning - Fatal编程技术网

R 使用重复但不同的因子级别清理源数据和目标数据

R 使用重复但不同的因子级别清理源数据和目标数据,r,duplicates,data.table,gis,data-cleaning,R,Duplicates,Data.table,Gis,Data Cleaning,我有一些地理信息系统数据,包括起点和目的地OD,以及OD当天的时间信息。我打算制作一张地图,并根据一天中的时间信息给ODs上色 一件事是,一些ODs在数据集中有白天和黑夜,并且可能有不同的顺序。我想对这些进行不同的标记,例如白天/晚上 library(data.table) Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London") Destination<-c("Paris", "London", "Be

我有一些地理信息系统数据,包括起点和目的地OD,以及OD当天的时间信息。我打算制作一张地图,并根据一天中的时间信息给ODs上色

一件事是,一些ODs在数据集中有白天和黑夜,并且可能有不同的顺序。我想对这些进行不同的标记,例如白天/晚上

library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]

有没有一个简单的方法可以做到这一点?我的MWE只是一个OD,但我需要在其他几个OD中识别它。无论顺序如何,我都能找到副本,但我不知道如何找出是否有两个时间案例,以及如何用白天/晚上来替换它们

library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]


您可以使用dplyr包来实现这一点,如下所示

请随意更改条件以满足您的需要

library(data.table)
library(dplyr)

# Creating data
dt <- 
  data.table(
    Origin = c("London", "Paris", "Italy", "Spain", "Portugal", "Poland"),
    Destination = c("Paris", "London", "Norway", "Portugal", "Spain", "Spain"),
    Time = c("Day", "Night", "Day", NA_character_, NA_character_, NA_character_)
  )

dt

# Origin Destination  Time
# London   Paris      Day
# Paris    London     Night
# Italy    Norway     Day
# Spain    Portugal   <NA>
# Portugal Spain      <NA>
# Poland   Spain      <NA>

dt %>%
  # pmin and pmax are used to sort the 2 columns
  # in order to group by them regardless to their order
  group_by(Origin2 = pmin(Origin, Destination), 
           Destination2 = pmax(Origin, Destination)) %>%
  mutate(count = n(), # To check if Origin/destination are repeated or not
         row = row_number(), # Place holder to know if it was first to repeat or second
         # If not repeated then make Time = Day
         # If repeated and first occurance then Time = Day
         # If repeated and second occurance then Time = Night
         Time = case_when(count == 1 ~ "Day",
                          count == 2 & row == 1 ~ "Day",
                          count == 2 & row == 2 ~ "Night")) %>%
  ungroup() %>%
  select(Origin, Destination, Time)

# Origin   Destination Time 
# <chr>    <chr>       <chr>
#   1 London   Paris       Day  
# 2 Paris    London      Night
# 3 Italy    Norway      Day  
# 4 Spain    Portugal    Day  
# 5 Portugal Spain       Night
# 6 Poland   Spain       Day  

感谢@Nareman Darwisch提供的dplyr解决方案,它为我提供了使用data.table解决方案的灵感

我正在创建一个新变量,作为每个始发地和目的地的唯一ID

dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt.temp<-data.table(dat.sort)
dt.temp[,unique.name:=paste(V1,V2)]
dt$unique.name<-factor(dt.temp$unique.name)
我想了解的是,我如何能够本着按组查看级别的精神使用逻辑条件,并将这些级别与我想要的案例进行比较

dt[,No.levels.logi:=sum(levels(Time) %in% c("Day", "Night"))>1 , by=unique.name]

但是我猜levels命令总是给我所有三个级别。

如果我理解正确,OP希望

识别城市对,而不考虑起点和目的地的顺序,例如,伦敦-巴黎与巴黎-伦敦属于同一城市对 如果城市对日夜或日夜运行,则折叠单独的行 或者更新原始数据集 这就是我要做的:

library(data.table)
dt <- data.table(Origin, Destination, Time)
# add city pair as unique grouping variable
dt[, Pair := paste(pmin(Origin, Destination), pmax(Origin, Destination), sep = "-")][]
# identify city pairs which are operated day and night
pairs_DN <- dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair][(V1), .(Pair)]
# update original dataset by an update join
dt[pairs_DN, on = "Pair", Time := "Day/Night"][]
关键点是确定满足第二个要求的城市对:

dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair]  
因此,没有必要处理因素水平。顺便说一句,因子水平是整个列的一个属性,在子集或分组时不会改变。改变的是在子集或组中使用了哪些级别

pairs\u DN包含这些城市对的唯一密钥


那么,您是否正在尝试查找是否存在同两个国家/地区在白天有两条记录的情况?我正在尝试查找在白天和晚上运行的同一始发地/目的地的记录,并希望在夜间/白天对其重新编码
dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair]  
            Pair    V1
1:  London-Paris  TRUE
2: Berlin-Lisbon FALSE
3: Lisbon-Madrid  TRUE
            Pair
1:  London-Paris
2: Lisbon-Madrid