大数据集上的R矢量化findInterval_R_For Loop_Vectorization

大数据集上的R矢量化findInterval

r for-loop

大数据集上的R矢量化findInterval,r,for-loop,vectorization,R,For Loop,Vectorization,我有两个数据帧，我正在使用findInterval。井眼数据是用于产油的井眼x、y和z的数据（VSS=垂直海底深度，md=测量深度，也称钻头沿井向下移动的实际距离）。性能数据是指井筒已穿孔以允许流动的数据（顶部性能=md，底部性能=md）性能： Well_ID top_perf bot_perf well_name surface ID x y VSS 056-W 2808 2958 056-W Ranger

我有两个数据帧，我正在使用findInterval。井眼数据是用于产油的井眼x、y和z的数据（VSS=垂直海底深度，md=测量深度，也称钻头沿井向下移动的实际距离）。性能数据是指井筒已穿孔以允许流动的数据（顶部性能=md，底部性能=md）

性能：

Well_ID   top_perf    bot_perf    well_name   surface   ID  x   y   VSS
056-W        2808        2958        056-W     Ranger   2   0   0   0
056-W        3150        3250        056-W       Ranger 1   0   0   0
056-W        3150        3250        056-W       Ranger 2   0   0   0
056-W        3559        3664        056-W       UT 1   1   0   0   0
056-W        3559        3664        056-W       UT 2   2   0   0   0
057-W        2471        2952        057-W       Tar    1   0   0   0
057-W        2471        2952        057-W       Tar    2   0   0   0
058-W        2615        2896        058-W       Ranger 1   0   0   0
058-W        2615        2896        058-W       Ranger 2   0   0   0

井筒：

well_name   well_id      md      vss         x       y          
056-W        056-W       3260   -3251.46    4221436 4030454
056-W        056-W       3280   -3271.45    4221436 4030454
056-W        056-W       3300   -3291.45    4221435 4030453
056-W        056-W       3320   -3311.44    4221435 4030453
056-W        056-W       3340   -3331.44    4221434 4030453
056-W        056-W       3360   -3351.43    4221434 4030453
056-W        056-W       3380   -3371.43    4221433 4030453
056-W        056-W       3400   -3391.42    4221433 4030453

目标是找到与井筒$md最接近的性能$top_perf和性能$bot_perf，其中性能$Well_ID=井筒$Well_ID，然后从井筒中提取vss、x和y，并将其添加到性能中。（如果介于两者之间，我不在乎插值，只需要接近的东西）

以下是我执行此操作的代码：

for(i in 1:dim(Perfs)[1]){
  if(Perfs$ID[i] == 1){
    Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),]
    interval <- findInterval(Perfs[i,"top_perf"], Wellbore_temp$md)
    Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")]
  }else{
    Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),]
    interval <- findInterval(Perfs[i,"bot_perf"], Wellbore_temp$md)
    Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")]
  }
}

for（i in 1:dim（Perfs）[1]）{
if（性能$ID[i]==1）{
下面给出了一个Data.Table解决方案。我只对你所显示的小数据集进行了测试，在那个小的数据集上它比你的解决方案慢，但是我认为它可能会更好。如果不是，考虑并行化。< /P>
如果您以前没有使用过data.table，我认为它通常非常快，但语法可能有点复杂。.SD
指的是连接到性能数据第一行的井筒数据子集（遍历.EACHI
）。这将节省大量的所有内容到所有内容的连接。我不使用findInterval函数，而是计算错误（top_perf-md
或bot_perf-md
）并将绝对错误降至最低。与滚动连接（“最近”）相比，这种方法的优点是，您可以看到错误是什么，并在必要时进行过滤
library(data.table)

Perfs <- fread(input = 'Well_ID   top_perf    bot_perf    well_name   surface   ID  x   y   VSS
056-W        2808        2958        056-W     Ranger   2   0   0   0
056-W        3150        3250        056-W       Ranger 1   0   0   0
056-W        3150        3250        056-W       Ranger 2   0   0   0
056-W        3559        3664        056-W       UT_1   1   0   0   0
056-W        3559        3664        056-W       UT_2   2   0   0   0
057-W        2471        2952        057-W       Tar    1   0   0   0
057-W        2471        2952        057-W       Tar    2   0   0   0
058-W        2615        2896        058-W       Ranger 1   0   0   0
058-W        2615        2896        058-W       Ranger 2   0   0   0')

Wellbore <- fread(input = 'well_name   well_id      md      vss         x       y          
056-W        056-W       3260   -3251.46    4221436 4030454
056-W        056-W       3280   -3271.45    4221436 4030454
056-W        056-W       3300   -3291.45    4221435 4030453
056-W        056-W       3320   -3311.44    4221435 4030453
056-W        056-W       3340   -3331.44    4221434 4030453
056-W        056-W       3360   -3351.43    4221434 4030453
056-W        056-W       3380   -3371.43    4221433 4030453
056-W        056-W       3400   -3391.42    4221433 4030453')


#top
setkey(Wellbore, 'well_id')
setkey(Perfs, 'Well_ID', 'top_perf')
top_matched <- Wellbore[unique(Perfs), .SD[which.min(abs(top_perf-md)),.(md, top_perf, err=top_perf-md, x,y,vss)],nomatch=0, by=.EACHI]
setkey(top_matched, 'well_id', 'top_perf')
top_joined <- top_matched[Perfs]
top_joined[,`:=`(i.x=NULL, i.y=NULL,VSS=NULL)]
setnames(top_joined, old=c('err', 'x', 'y', 'vss'), new=paste0('top_', c('err', 'x', 'y', 'vss')))

#bottom
setkey(Perfs, 'Well_ID', 'bot_perf')
bot_matched <- Wellbore[unique(Perfs), .SD[which.min(abs(bot_perf-md)),.(md, bot_perf, err=bot_perf-md, x,y,vss)],nomatch=0, by=.EACHI]
setkey(bot_matched, 'well_id', 'bot_perf')
bot_joined <- bot_matched[Perfs]
bot_joined[,`:=`(i.x=NULL, i.y=NULL,VSS=NULL)]
setnames(bot_joined, old=c('err', 'x', 'y', 'vss'), new=paste0('bot_', c('err', 'x', 'y', 'vss')))


answer <- cbind(top_joined[,c(1:2,9:11,3:7), with=F], bot_joined[,3:7,with=F])

# well_id   md well_name surface ID top_perf top_err   top_x   top_y  top_vss bot_perf bot_err
# 1:   056-W 3260     056-W  Ranger  2     2808    -452 4221436 4030454 -3251.46     2958    -302
# 2:   056-W 3260     056-W  Ranger  1     3150    -110 4221436 4030454 -3251.46     3250     -10
# 3:   056-W 3260     056-W  Ranger  2     3150    -110 4221436 4030454 -3251.46     3250     -10
# 4:   056-W 3400     056-W    UT_1  1     3559     159 4221433 4030453 -3391.42     3664     264
# 5:   056-W 3400     056-W    UT_2  2     3559     159 4221433 4030453 -3391.42     3664     264
# 6:   057-W   NA     057-W     Tar  1     2471      NA      NA      NA       NA     2952      NA
# 7:   057-W   NA     057-W     Tar  2     2471      NA      NA      NA       NA     2952      NA
# 8:   058-W   NA     058-W  Ranger  1     2615      NA      NA      NA       NA     2896      NA
# 9:   058-W   NA     058-W  Ranger  2     2615      NA      NA      NA       NA     2896      NA
# bot_x   bot_y  bot_vss
# 1: 4221436 4030454 -3251.46
# 2: 4221436 4030454 -3251.46
# 3: 4221436 4030454 -3251.46
# 4: 4221433 4030453 -3391.42
# 5: 4221433 4030453 -3391.42
# 6:      NA      NA       NA
# 7:      NA      NA       NA
# 8:      NA      NA       NA
# 9:      NA      NA       NA

库（data.table）
Perfs在这里找到了问题的答案：
基于@ds440提供的data.table的思想
以下是我使用的代码，运行速度非常快：
Perf.Data <- Perfs


Wellbore.Perfs <- data.table(Wellbore[,c("well_id","md","vss")])
Spotfire.Top.Perf <- data.table(Perf.Data[,c("Well_ID","top_perf", "bot_perf")])
Spotfire.Bot.Perf <- data.table(Perf.Data[,c("Well_ID","bot_perf", "top_perf")])

#Change the column names to match up with Wellbore.Perfs
#Add in the bot_perf to .top.perf and the top_perf to the .bot.perf is done to make these unique and ensure everything is captured from the perfs table
colnames(Spotfire.Top.Perf) <- c("well_id","md", "bot_perf")
colnames(Spotfire.Bot.Perf) <- c("well_id","md","top_perf")

#set key to join on
setkey(Wellbore.Perfs, "well_id","md")

#roll = "nearest" will take the nearest value of md in .top.perf or .bot.perf and match it to the md in wellbore.perfs where Well_ID = Well_ID
Perfs.Wellbore.Top <- Wellbore.Perfs[Spotfire.Top.Perf, roll = "nearest"]
Perfs.Wellbore.Bot <- Wellbore.Perfs[Spotfire.Bot.Perf, roll = "nearest"]

Perf.Data谢谢，它的伸缩性肯定更好，但是使用中点和误差函数会产生一些问题，比如如果top_Perf-bot_Perf=300，那么中点与这些Perf沿井筒的实际位置不太接近/top_Perf和bot_Perf之间没有区别。这是为了绘图，我负担不起显示距离实际值150英尺的点。您可以分别使用top\u perf
和bot\u perf
创建top\u matched
和bot\u matched
（而不是midpt
）。如果你计算错误，你可以过滤掉不相关的匹配。我现在编辑了，去掉了中点，现在可以直接比较md和顶部或底部。如果只是在显示的样本数据上运行，一些“最近的”匹配是非常糟糕的，但是你看不到，除非你计算距离。我应该在t中提到这一点这是一个原始的帖子，但是这个井眼表有MD，范围从0到井底，每口井增加20个。这一点很好。