R 基于查找表重新查找data.table中的值_R_Data.table_Lookup

R 基于查找表重新查找data.table中的值

R 基于查找表重新查找data.table中的值,r,data.table,lookup,R,Data.table,Lookup,我正试图在一个非常大的数据集（约2500万行，约3000列）上实现一个relookup，这个数据集也是基于一个大的lookuptable（约1500万行）我只需要更改匹配的值，而不需要更改不匹配的值这是一个数据示例查找表 source target A 1 B 2 C 3 D 4 ... ... 源数据（在加载表之前，我不知道cols计数）预期结果 col1 col2 col3 ... coln 2

我正试图在一个非常大的数据集（约2500万行，约3000列）上实现一个relookup，这个数据集也是基于一个大的lookuptable（约1500万行）我只需要更改匹配的值，而不需要更改不匹配的值

这是一个数据示例

查找表

source  target
A       1
B       2
C       3
D       4
...     ...

源数据（在加载表之前，我不知道cols计数）

预期结果

col1    col2    col3    ...     coln
2       3       1       ...     ...
78      1       4       ...     ...
1       2       24      ...     ...
...     ...     ...     ...     ...

我已经能够使用嵌套循环实现这一点，但是：

这很慢

我知道他比那更聪明

我发现一些帖子也有类似的问题，但在我的案例中似乎没有一个解决方案是有效的

有什么建议吗

谢谢

（尝试了不同的解决方案作为解释，但没有成功）

使用

dplyr

和

tidyr

的解决方案。其思想是从宽格式重塑数据帧，然后基于数据帧和查找表中的值执行联接，然后将格式转换回原来的格式

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  mutate(ID = 1:n()) %>%
  gather(Column, Value, -ID) %>%
  left_join(dt_lookup, by = c("Value" = "source")) %>%
  mutate(target = as.numeric(ifelse(is.na(target), Value, target))) %>%
  select(-Value) %>%
  spread(Column, target) %>%
  select(-ID)
dt2
#   col1 col2 col3
# 1    2    3    1
# 2   78    1    4
# 3    1    2   24

数据

dt_lookup <- read.table(text = "source  target
A       1
                        B       2
                        C       3
                        D       4",
                        header = TRUE, stringsAsFactors = FALSE)

dt <- read.table(text = "col1    col2    col3
B       C       A   
                 78      A       D 
                 A       B       24",
                 header = TRUE, stringsAsFactors = FALSE)

dtu查找使用dplyr
和tidyr
的解决方案。其思想是从宽格式重塑数据帧，然后基于数据帧和查找表中的值执行联接，然后将格式转换回原来的格式
library(dplyr)
library(tidyr)

dt2 <- dt %>%
  mutate(ID = 1:n()) %>%
  gather(Column, Value, -ID) %>%
  left_join(dt_lookup, by = c("Value" = "source")) %>%
  mutate(target = as.numeric(ifelse(is.na(target), Value, target))) %>%
  select(-Value) %>%
  spread(Column, target) %>%
  select(-ID)
dt2
#   col1 col2 col3
# 1    2    3    1
# 2   78    1    4
# 3    1    2   24

数据
dt_lookup <- read.table(text = "source  target
A       1
                        B       2
                        C       3
                        D       4",
                        header = TRUE, stringsAsFactors = FALSE)

dt <- read.table(text = "col1    col2    col3
B       C       A   
                 78      A       D 
                 A       B       24",
                 header = TRUE, stringsAsFactors = FALSE)

dt_lookup根据OP，两个数据对象都非常大（25m行x3000列，查找表为15m行）。因此，我建议避免抄袭
这可以通过使用data.table
的连接更新来实现，该更新只在适当的位置修改选定的值，即不复制整个数据对象
library(data.table)
options(datatable.print.class = TRUE)
address(data_set)
# loop over all columns
for (col in names(data_set))
  # update on join
  data_set[lookup, on = paste0(col, "==source"), (col) := target]
address(data_set)
data_set[]

请注意，参数colClasses=“character”
确保“target”的类型为“character”
data\u set根据OP，这两个数据对象都非常大（25m行x 3000列，以及15m行的查找表）。因此，我建议避免抄袭
这可以通过使用data.table
的连接更新来实现，该更新只在适当的位置修改选定的值，即不复制整个数据对象
library(data.table)
options(datatable.print.class = TRUE)
address(data_set)
# loop over all columns
for (col in names(data_set))
  # update on join
  data_set[lookup, on = paste0(col, "==source"), (col) := target]
address(data_set)
data_set[]

请注意，参数colClasses=“character”
确保“target”的类型为“character”
data\u设置的可能重复项的可能重复项
library(data.table)
options(datatable.print.class = TRUE)
lookup <- fread("source  target
A       1
B       2
C       3
D       4", colClasses = "character")

data_set <- fread("col1    col2    col3
B       C       A
78      A       D
A       B       24")