R data.table联接中的列名标签

R data.table联接中的列名标签,r,join,data.table,non-equi-join,R,Join,Data.table,Non Equi Join,我正在尝试使用非等联接将data.table x联接到z。表x包含两列X1和X2,它们用作与z中的列Z1连接的范围。当前代码成功地进行了非等联接,但删除或重命名了某些列。我想返回提供的“理想”data.table,而不是我当前拥有的一个,我必须重命名列或进一步连接数据才能获得提供的“理想”数据 > library(data.table) > > x <- data.table(Id = c("A", "B", "C&q

我正在尝试使用非等联接将data.table x联接到z。表x包含两列X1和X2,它们用作与z中的列Z1连接的范围。当前代码成功地进行了非等联接,但删除或重命名了某些列。我想返回提供的“理想”data.table,而不是我当前拥有的一个,我必须重命名列或进一步连接数据才能获得提供的“理想”数据

> library(data.table)
> 
> x <- data.table(Id  = c("A", "B", "C", "C"),
+                 X1  = c(1L, 3L, 5L, 7L),
+                 X2 = c(8L,12L,9L,18L),
+                 XY  = c("x2", "x4", "x6", "x8"))
> 
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
> 
> x
   Id X1 X2 XY
1:  A  1  8 x2
2:  B  3 12 x4
3:  C  5  9 x6
4:  C  7 18 x8
> z
   ID Z1 Z2
1:  C  5 z5
2:  C  6 z6
3:  C  7 z7
4:  C  8 z8
5:  C  9 z9
> 
> # suboptimal data return data format
> x[z, on = .(Id == ID, X1 <= Z1, X2 >= Z1)]
   Id X1 X2 XY Z2
1:  C  5  5 x6 z5
2:  C  6  6 x6 z6
3:  C  7  7 x6 z7
4:  C  7  7 x8 z7
5:  C  8  8 x6 z8
6:  C  8  8 x8 z8
7:  C  9  9 x6 z9
8:  C  9  9 x8 z9
> 
> # column names are Id, X1 and X2 from x which replaces ID and Z1. The contents of X1 and X2 are also changed to the original values of Z1.
> # XY and Z2 remain unchanged.
> 
> # I want to create the following table where the original column names and values are retained, while still joining the table in a non-equi way.
> 
> ideal <- data.table(ID = c("C", "C", "C", "C", "C", "C", "C", "C"),
+                     Z1 = c(5, 6, 7, 7, 8, 8, 9, 9),
+                     Z2 = c("Z5", "z6", "z7", "z7", "z8", "z8", "z9", "z9"),
+                     X1 = c(5, 5, 5, 7, 5, 7, 5, 7),
+                     X2 = c(9, 9, 9, 18, 9, 18, 9, 18),
+                     XY = c("x6", "x6", "x6", "x8", "x6", "x8", "x6", "x8"))
> 
> print(ideal)
   ID Z1 Z2 X1 X2 XY
1:  C  5 Z5  5  9 x6
2:  C  6 z6  5  9 x6
3:  C  7 z7  5  9 x6
4:  C  7 z7  7 18 x8
5:  C  8 z8  5  9 x6
6:  C  8 z8  7 18 x8
7:  C  9 z9  5  9 x6
8:  C  9 z9  7 18 x8
>库(data.table)
> 
>x
>z
>x
Id X1 X2 XY
1:A 18 x2
2:B312 x4
3:C59x6
4:C718x8
>z
ID Z1 Z2
1:C5Z5
2:C6Z6
3:C7Z7
4:c8z8
5:C9Z9
> 
>#次优数据返回数据格式
>x[z,on=(Id==Id,X1=Z1)]
Id X1 X2 XY Z2
1:C55x6Z5
2:C6x6Z6
3:C7 x6 z7
4:C7 x8 z7
5:c8 x6 z8
6:c8 x8 z8
7:C9X6Z9
8:C9X8Z9
> 
>#列名是从x开始的Id、X1和X2,x取代Id和Z1。X1和X2的内容也更改为Z1的原始值。
>#XY和Z2保持不变。
> 
>#我想创建下表,保留原始列名和值,同时仍以非相等方式连接表。
> 
>理想的
>打印(理想)
ID Z1 Z2 X1 X2 XY
1:C5Z559X6
2:C6Z659X6
3:C7Z759X6
4:C7Z718X8
5:C8Z859X6
6:c8z8718x8
7:C9Z959X6
8:C9Z9718X8
复制一份

copy_cols <- function(dt, nms) {
  dt[, paste0(".", nms) := lapply(.SD, copy), .SDcols = nms]
}

copy_cols(x, c("X1", "X2"))
copy_cols(z, "Z1")
x[z, on = .(Id == ID, .X1 <= .Z1, .X2 >= .Z1)][, c(".X1", ".X2") := NULL][]

正如@Humpelstielzchen所评论的,可以通过手动选择所需的列来实现。但是必须使用前缀
x.
x.
中的
x
指的是
[.data.table
的参数
x
,而不是data.table的名称)来恢复原始data.table
x
中的列。否则,将产生不正确的输出

# desired
x[z, .(ID, Z1, Z2, X1 = x.X1, X2 = x.X2, XY), on = .(Id == ID, X1 <= Z1, X2 >= Z1)]
#    ID Z1 Z2 X1 X2 XY
# 1:  C  5 z5  5  9 x6
# 2:  C  6 z6  5  9 x6
# 3:  C  7 z7  5  9 x6
# 4:  C  7 z7  7 18 x8
# 5:  C  8 z8  5  9 x6
# 6:  C  8 z8  7 18 x8
# 7:  C  9 z9  5  9 x6
# 8:  C  9 z9  7 18 x8

# undesired
x[z, on = .(Id == ID, X1 <= Z1, X2 >= Z1), .(ID, Z1, Z2, X1, X2, XY)]
#    ID Z1 Z2 X1 X2 XY
# 1:  C  5 z5  5  5 x6
# 2:  C  6 z6  6  6 x6
# 3:  C  7 z7  7  7 x6
# 4:  C  7 z7  7  7 x8
# 5:  C  8 z8  8  8 x6
# 6:  C  8 z8  8  8 x8
# 7:  C  9 z9  9  9 x6
# 8:  C  9 z9  9  9 x8

packageVersion('data.table')
# '1.13.2'

最后我回答了自己的问题

data_table_tidy_join <- function(x,y, join_by){

    x <- data.table(x)
    y <- data.table(y)

    # Determine single join names
    single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()

    # cols from x that won't require as matching in i
    remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)

    # names need to keep
    x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
    y_names_keep_raw <- names(y)

    # cols that exist in both x and y, but not being equi joined on
    cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]

    #rename so indexing works
    x_names_keep <- x_names_keep_raw
    y_names_keep <- y_names_keep_raw

    # give prefix to necessary vars
    x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
    y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")

    # implement data.table call, keeping required cols
    joined_data <-
        x[y, on = join_by,
          mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
          mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
        as_tibble()

    return(joined_data)

}

> x <- data.table(Id  = c("A", "B", "C", "C"),
+                  X1  = c(1L, 3L, 5L, 7L),
+                  X2 = c(8L,12L,9L,18L),
+                  XY  = c("x2", "x4", "x6", "x8"))
>  
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>   
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
  ID       Z1 Z2       X1    X2 XY   
  <chr> <int> <chr> <int> <int> <chr>
1 C         5 z5        5     9 x6   
2 C         6 z6        5     9 x6   
3 C         7 z7        5     9 x6   
4 C         7 z7        7    18 x8   
5 C         8 z8        5     9 x6   
6 C         8 z8        7    18 x8   
7 C         9 z9        5     9 x6   
8 C         9 z9        7    18 x8

data\u table\u tidy\u join您可以对j表达式中的列进行筛选和排序:
x[z,on=(Id==Id,X1=Z1),(Id,Z1,Z2,X1,X2,XY)]
不是很优雅,但很有效。@Humpelstielzchen,应该对此进行一些修改。请参见我的答案:)@mt1022干得好这似乎是迄今为止最好的答案。不太确定它是否能推广到包含更多列的更大数据集。手动添加列不是一种非常有效的方法。我想知道这是否仅仅是data.table实现的问题?或者是否有不同的非等联接方式。更重要的是,data.table为什么首先要重新标记用于联接的列?我不知道:(.但是,通过更改联接的顺序,可以在不手动选择列的情况下获得所需的结果,尽管列的顺序与
ideal
data.table不同。
DT <- z[x, on = .(ID=Id, Z1 >= X1, Z1 <= X2), nomatch = NULL]
#' since for non-equi conditions, the values are from RHS while
#' the column names were from LHS, we known that `Z1` and `Z1.1`
#' correspond to `X1` and `X2`.
setnames(DT, c('Z1', 'Z1.1'), c('X1', 'X2'))
DT[z, Z1 := i.Z1, on = .(ID, Z2)]
# > DT
#    ID X1 Z2 X2 XY Z1
# 1:  C  5 z5  9 x6  5
# 2:  C  5 z6  9 x6  6
# 3:  C  5 z7  9 x6  7
# 4:  C  5 z8  9 x6  8
# 5:  C  5 z9  9 x6  9
# 6:  C  7 z7 18 x8  7
# 7:  C  7 z8 18 x8  8
# 8:  C  7 z9 18 x8  9
data_table_tidy_join <- function(x,y, join_by){

    x <- data.table(x)
    y <- data.table(y)

    # Determine single join names
    single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()

    # cols from x that won't require as matching in i
    remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)

    # names need to keep
    x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
    y_names_keep_raw <- names(y)

    # cols that exist in both x and y, but not being equi joined on
    cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]

    #rename so indexing works
    x_names_keep <- x_names_keep_raw
    y_names_keep <- y_names_keep_raw

    # give prefix to necessary vars
    x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
    y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")

    # implement data.table call, keeping required cols
    joined_data <-
        x[y, on = join_by,
          mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
          mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
        as_tibble()

    return(joined_data)

}

> x <- data.table(Id  = c("A", "B", "C", "C"),
+                  X1  = c(1L, 3L, 5L, 7L),
+                  X2 = c(8L,12L,9L,18L),
+                  XY  = c("x2", "x4", "x6", "x8"))
>  
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>   
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
  ID       Z1 Z2       X1    X2 XY   
  <chr> <int> <chr> <int> <int> <chr>
1 C         5 z5        5     9 x6   
2 C         6 z6        5     9 x6   
3 C         7 z7        5     9 x6   
4 C         7 z7        7    18 x8   
5 C         8 z8        5     9 x6   
6 C         8 z8        7    18 x8   
7 C         9 z9        5     9 x6   
8 C         9 z9        7    18 x8