r-检查向量上的每个值在一组区域上的次数

r-检查向量上的每个值在一组区域上的次数,r,geo,R,Geo,我有两个数据帧,第一个有一些点的坐标,另一个有一组区域,对lat和lon都有限制。我想知道每一个点,它所落的区域,以及它的总容量 例如,df1具有点,df2具有面积和容量 df1 <- data.frame(cluster = c("id1", "id2", "id3"), lat_m = c(-3713015, -4086295, -3710672), lon_m = c(-6556760, -6516930, -6569831))

我有两个数据帧,第一个有一些点的坐标,另一个有一组区域,对lat和lon都有限制。我想知道每一个点,它所落的区域,以及它的总容量

例如,df1具有点,df2具有面积和容量

df1 <- data.frame(cluster = c("id1", "id2", "id3"),
              lat_m = c(-3713015, -4086295, -3710672),
              lon_m = c(-6556760, -6516930, -6569831))
df2 <- data.frame(id = c("a1","a2","a3"),
              max_lat = c(-3713013,-3713000, -3710600),
              min_lat = c(-3713017,-3713100, -3710700),
              max_lon = c(-6556755,-6556740, -6569820),
              min_lon = c(-6556765,-6556800, -6569840),
              capacity = c(5,2,3))
我想买这样的东西

result <- data.frame(cluster = c("id1", "id2", "id3"),
                 areas = c(2, 0, 1),
                 areas_id = c("a1, a2", "", "a3"),
                 capacity = c(7, 0, 3))

我的数据有超过100万个点和超过10000个区域,因此理想情况下我应该避免for循环

您可以在>=和上将两个表连接在一起您可以在>=和上将两个表连接在一起这是一个使用sqldf和dplyr的解决方案-

下面是一个使用sqldf和dplyr的解决方案-


正当编辑为更清晰的权利!为了更清楚地了解我在网上找到的基准,对data.table进行了编辑,它很可能会让sqldf崩溃。我不确定这个具体的例子,因为我对这两个包都不太熟悉。从我在网上找到的基准测试来看,data.table可能会让sqldf崩溃。我不确定这个具体的情况,虽然我不是很精通这两个包。
library(data.table)
library(magrittr) # not necessary, just loaded for %>%
setDT(df1)
setDT(df2)

df2[df1, on = .(min_lat <= lat_m, max_lat >= lat_m, min_lon <= lon_m, max_lon >= lon_m)
    , .(cluster, id, capacity)] %>% # these first two lines do the join
  .[, .(areas = sum(!is.na(capacity))
       , areas_id = paste(id, collapse = ', ')
       , capacity = sum(capacity, na.rm = T))
    , by = cluster] # this summarises each cluster group of rows


#    cluster areas areas_id capacity
# 1:     id1     2   a1, a2        7
# 2:     id2     0       NA        0
# 3:     id3     1       a3        3
library(sqldf)

sqldf("
select    df1.cluster
          , case  when sum(df2.capacity) is NULL
                    then 0
                  else count(*)
          end as areas
          , group_concat(df2.id) as areas_id
          , coalesce(sum(df2.capacity), 0) as capacity
from      df1 
          left join df2 
          on  df1.lat_m between df2.min_lat and df2.max_lat 
              and df1.lon_m between df2.min_lon and df2.max_lon
group by  df1.cluster
")

#   cluster areas areas_id capacity
# 1     id1     2    a1,a2        7
# 2     id2     0     <NA>        0
# 3     id3     1       a3        3
library(sqldf)
library(dplyr)

sql <- paste0(
         "SELECT df1.cluster, df2.id, df2.capacity ",
         "FROM df1 LEFT JOIN df2 ON (df1.lat_m BETWEEN df2.min_lat AND df2.max_lat) AND ",
         "(df1.lon_m BETWEEN df2.min_lon AND df2.max_lon)"
       )

result <- sqldf(sql) %>%
  group_by(cluster) %>%
  summarise(
    areas = n_distinct(id) - anyNA(id),
    areas_id = toString(id),
    capacity = sum(capacity, na.rm = T)
  )

# A tibble: 3 x 4
  cluster areas areas_id capacity
  <fct>   <int> <chr>       <dbl>
1 id1         2 a1, a2       7.00
2 id2         0 NA           0   
3 id3         1 a3           3.00