邮政编码距离R

邮政编码距离R,r,geospatial,distance,zipcode,R,Geospatial,Distance,Zipcode,我使用的是R中的邮政编码包,我想列出距离每个邮政编码10英里、20英里或X英里半径范围内的所有邮政编码。从那里,我将把邮政编码数据汇总到10英里、20英里或X英里。我现在加入了每个邮政编码,每个邮政编码(所以行数是正方形)。然后计算每个邮政编码之间的距离。然后消除大于10,20,X英里的距离。有没有更好的方法在R中实现这一点,这样我就不必计算所有的可能性?我是R的新手,谢谢 Code is here: #Bringing in Zipcode database. library(zipcode

我使用的是R中的邮政编码包,我想列出距离每个邮政编码10英里、20英里或X英里半径范围内的所有邮政编码。从那里,我将把邮政编码数据汇总到10英里、20英里或X英里。我现在加入了每个邮政编码,每个邮政编码(所以行数是正方形)。然后计算每个邮政编码之间的距离。然后消除大于10,20,X英里的距离。有没有更好的方法在R中实现这一点,这样我就不必计算所有的可能性?我是R的新手,谢谢

Code is here:
#Bringing in Zipcode database. 
library(zipcode)
data(zipcode)

#Limiting to certain states that I want to include,
SEZips <- zipcode[zipcode$state %in% c("GA","AL", "SC", "NC"),]

#Duplicating the data set to join it together
SEZips2 <- SEZips

#To code in SQL
library(sqldf)

#Creating a common match so I can join all rows from both tables together
SEZips$Match <- 1
SEZips2$Match <- 1

#attaches every zip code to each zip
ZipList <- sqldf("
                 SELECT
                 A.zip as zip1,
                 A.longitude as lon1,
                 A.latitude as lat1,
                 B.zip as zip2,
                 B.longitude as lon2,
                 B.latitude as lat2
                 From SEZips A
                 Left Join SEZips2 B
                 on A.Match = B.Match
                 ")


#to get the distance calculation, use package geosphere, 
library(geosphere)

#radius of Earth in miles, adjust for km, etc.
r = 3959
#Creating Table of the coordinates. Makes it easy to calc distance
Points1 <- cbind(ZipList$lon1,ZipList$lat1)
Points2 <- cbind(ZipList$lon2,ZipList$lat2)
distance <- distHaversine(Points1,Points2,r)

#Adding distance back on to the original ZipList
ZipList$Distance <- distance

#To limit to a certain radius.E.g. 15 for 15 miles.
z = 15
#Eliminating matches > z 
ZipList2 <- ZipList[ZipList$Distance <= z,]

#Adding data to roll up, e.g. population
ZipPayroll <- read.csv("filepath/ZipPayroll.csv")

#Changin Zip to 5 character from integer. A little bit of pain
#Essentailly code says (add 5 0's, and then grab the right 5 characters)
ZipPayroll$Zip2 <- substr(paste("00000",ZipPayroll$zip,sep=""),nchar(paste("00000",ZipPayroll$zip,sep=""))-4,nchar(paste("00000",ZipPayroll$zip,sep="")))

#Joining Payroll info to SEZips dataframe
SEZips <- sqldf("
                SELECT
                A.*,
                B.Payroll, 
                B.Employees,
                B.Establishments
                From SEZips A
                Left Join ZipPayroll B
                on A.zip = B.Zip2
                ")

#Rolling up to 15 mile level
SEZips15 <- sqldf("
                  SELECT
                  A.zip1 as Zip, 
                  Sum(B.Payroll) as PayrollArea,
                  Sum(B.Employees) as EmployeesArea,
                  Sum(B.Establishments) as EstablishmentsArea
                  From ZipList2 A
                  Left Join SEZips B
                  on A.zip2 = B.zip
                  Group By A.zip1
                  ")

#Include the oringinal Zip data 
SEZips15 <- sqldf("
                  SELECT
                  A.*,
                  B.Payroll,
                  B.Employees,
                  B.Establishments as EstablishmentsArea
                  From SEZips15 A
                  Left Join SEZips B
                  on A.zip = B.zip
                  ")

#Calculate Average Pay for Zip and Area
SEZips15$AvgPayArea <- SEZips15$PayrollArea / SEZips15$EmployeesArea
SEZips15$AvgPay <- SEZips15$Payroll / SEZips15$Employees
代码在这里:
#引入Zipcode数据库。
库(zipcode)
数据(zipcode)
#仅限于我想包括的某些州,

SEZips我在下面添加了一个使用空间风险包的解决方案。这个包中的关键功能是用C++编写的,因此非常快。 函数spatialrisk::points_in_circle()计算从中心点算起的半径范围内的观测值。请注意,距离是使用哈弗公式计算的

library(spatialrisk)
library(tidyverse)

zips_within_radius <- function(x,y,z) {
  points_in_circle(SEZips, x, y, lon = longitude, lat = latitude, radius = 10000) %>% 
    mutate(source_zip = z)
 }

您当前的邮政编码实际上会将每个邮政编码与其他邮政编码进行两次比较。(A为16127,B为27513,A为27513,B为16127)你只需比较一次就可以把工作量减半。如果您将其视为二维网格或表格,则只需在对角线上方从左上到右下进行匹配。另外,您可以只在
1=1
上进行连接,避免创建
匹配
变量…另外,作为后续问题。。。您是否遇到性能问题?如果不是,您的解决方案可能很好。。。如果是这样,您可以从优化中获益…谢谢John!至于把每件事都比较两次,我想我需要两者兼而有之。例如,我需要拉链匹配16127、27513和拉链匹配2751316217,。这让我能够分别总结16127和27513拉链周围的所有拉链。会有重叠,这是意料之中的。我目前没有优化问题,但它只有4种数据状态,当我添加更多时,它会呈指数增长。我是R的新手,所以我在sqldf中做了很多工作,我很好奇在R或地理空间计算中是否有更有效的方法。顺便说一句,感谢match上的提示!从zip 16127到27513的距离与从27513到16127的距离相同。如果要节省CPU周期,可以计算一次该距离,然后在必要时转换数据。(尽管转换数据在计算上可能比进行计算更昂贵,但请尝试并查看)。
pmap_dfr(list(SEZips$longitude, SEZips$latitude, SEZips$zip), zips_within_radius)