R 更快地处理来自大型数据帧的120万JSON地理位置查询_R_Dataframe_Geolocation

R 更快地处理来自大型数据帧的120万JSON地理位置查询

r dataframe geolocation

R 更快地处理来自大型数据帧的120万JSON地理位置查询,r,dataframe,geolocation,R,Dataframe,Geolocation,我正在开发一个大约有644万次签入的应用程序。这些签入的唯一位置为128万。但戈瓦拉只给出了纬度和经度。因此，我需要为每一个lat和long找到城市、州和国家。从StackOverflow上的另一篇文章中，我能够创建下面的R查询，查询开放街道地图并找到相关的地理细节不幸的是，处理125行大约需要1分钟，这意味着128万行需要几天的时间。有没有更快的方法找到这些细节？也许有一些内置的世界城市lat和long的软件包可以找到给定lat和long的城市名称，所以我不必在线查询场馆表是一个有3列的数

我正在开发一个大约有644万次签入的应用程序。这些签入的唯一位置为128万。但戈瓦拉只给出了纬度和经度。因此，我需要为每一个lat和long找到城市、州和国家。从StackOverflow上的另一篇文章中，我能够创建下面的R查询，查询开放街道地图并找到相关的地理细节

不幸的是，处理125行大约需要1分钟，这意味着128万行需要几天的时间。有没有更快的方法找到这些细节？也许有一些内置的世界城市lat和long的软件包可以找到给定lat和long的城市名称，所以我不必在线查询

场馆表是一个有3列的数据框：

1:vid（venueId），2:lat（纬度），3:long（经度）

for（1中的i:nrow（venueTable））{
#这只是一个在屏幕上显示i当前值的指示器
类别（粘贴（“.”，i“.”）
#下面的代码组成了url查询
url即使你坚持时间，你也会遇到问题。你正在查询的服务允许“绝对最大每秒一个请求”，这是你已经违反的。很可能在你达到120万次查询之前，他们会限制你的请求。他们的网站指出，类似的用于更大用途的API每天只有大约15k的免费请求uests
您最好使用脱机选项。快速搜索显示，有许多免费提供的人口密集地区的数据集，以及它们的经度和纬度。我们将使用以下数据集：
然后就可以很容易地找到离每个数据点最近的城市，并将其绑定到您的data.frame

# work out which index in the matrix is closest to the data
> closest.index <- apply(distance.matrix, 1, which.min)

# rbind city and country of match with original query
> candidate.longlat <- cbind(candidate.longlat, cities.data[closest.index, c("city", "country")])
> print(candidate.longlat)

  vid    lat   long       city    country
1   1  12.53 -70.03 Oranjestad      Aruba
2   2 -16.31 -48.95   Anapolis     Brazil
3   3  42.87  74.59    Bishkek Kyrgyzstan

#找出矩阵中最接近数据的索引
>最近的.index candidate.longlat打印（candidate.longlat）
维德拉特朗市乡村酒店
1 12.53-70.03奥兰杰斯塔德阿鲁巴
2 2-16.31-48.95巴西纳波利斯
3 3 42.87 74.59吉尔吉斯斯坦比什凯克
即使你坚持时间，你也会遇到问题。你正在查询的服务允许“绝对最大每秒一个请求”，而你已经违反了这一要求。他们很可能会在你达到120万次查询之前限制你的请求。他们的网站指出，类似的用于更大用途的API每天只有15k左右的免费请求
您最好使用脱机选项。快速搜索显示，有许多免费提供的人口密集地区的数据集，以及它们的经度和纬度。我们将使用以下数据集：
然后就可以很容易地找到离每个数据点最近的城市，并将其绑定到您的data.frame

# work out which index in the matrix is closest to the data
> closest.index <- apply(distance.matrix, 1, which.min)

# rbind city and country of match with original query
> candidate.longlat <- cbind(candidate.longlat, cities.data[closest.index, c("city", "country")])
> print(candidate.longlat)

  vid    lat   long       city    country
1   1  12.53 -70.03 Oranjestad      Aruba
2   2 -16.31 -48.95   Anapolis     Brazil
3   3  42.87  74.59    Bishkek Kyrgyzstan

#找出矩阵中最接近数据的索引
>最近的.index candidate.longlat打印（candidate.longlat）
维德拉特朗市乡村酒店
1 12.53-70.03奥兰杰斯塔德阿鲁巴
2 2-16.31-48.95巴西纳波利斯
3 3 42.87 74.59吉尔吉斯斯坦比什凯克
以下是使用R固有空间处理功能的另一种方法：
library(sp)
library(rgeos)
library(rgdal)

# world places shapefile
URL1 <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_populated_places.zip"
fil1 <- basename(URL1)
if (!file.exists(fil1)) download.file(URL1, fil1)
unzip(fil1)

places <- readOGR("ne_10m_populated_places.shp", "ne_10m_populated_places",
                  stringsAsFactors=FALSE)

# some data from the other answer since you didn't provide any
URL2 <- "http://simplemaps.com/resources/files/world/world_cities.csv"
fil2 <- basename(URL2)
if (!file.exists(fil2)) download.file(URL2, fil2)

# we need the points from said dat
dat <- read.csv(fil2, stringsAsFactors=FALSE)
pts <- SpatialPoints(dat[,c("lng", "lat")], CRS(proj4string(places)))

# this is not necessary
# I just don't like the warning about longlat not being a real projection
robin <- "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
pts <- spTransform(pts, robin)
places <- spTransform(places, robin)

# compute the distance (makes a pretty big matrix so you should do this
# in chunks unless you have a ton of memory or do it row-by-row
far <- gDistance(pts, places, byid=TRUE)

# find the closest one
closest <- apply(far, 1, which.min)

# map to the fields (you may want to map to other fields)
locs <- places@data[closest, c("NAME", "ADM1NAME", "ISO_A2")]

locs[sample(nrow(locs), 10),]

##              NAME        ADM1NAME ISO_A2
## 3274     Szczecin West Pomeranian     PL
## 1039     Balakhna      Nizhegorod     RU
## 1012       Chitre         Herrera     PA
## 3382     L'Aquila         Abruzzo     IT
## 1982       Dothan         Alabama     US
## 5159 Bayankhongor     Bayanhongor     MN
## 620        Deming      New Mexico     US
## 1907   Fort Smith        Arkansas     US
## 481      Dedougou        Mou Houn     BF
## 7169       Prague          Prague     CZ

库（sp）
图书馆（rgeos）
图书馆（rgdal）
#世界各地的形状文件
URL1以下是使用R固有空间处理能力的另一种方法：
library(sp)
library(rgeos)
library(rgdal)

# world places shapefile
URL1 <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_populated_places.zip"
fil1 <- basename(URL1)
if (!file.exists(fil1)) download.file(URL1, fil1)
unzip(fil1)

places <- readOGR("ne_10m_populated_places.shp", "ne_10m_populated_places",
                  stringsAsFactors=FALSE)

# some data from the other answer since you didn't provide any
URL2 <- "http://simplemaps.com/resources/files/world/world_cities.csv"
fil2 <- basename(URL2)
if (!file.exists(fil2)) download.file(URL2, fil2)

# we need the points from said dat
dat <- read.csv(fil2, stringsAsFactors=FALSE)
pts <- SpatialPoints(dat[,c("lng", "lat")], CRS(proj4string(places)))

# this is not necessary
# I just don't like the warning about longlat not being a real projection
robin <- "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
pts <- spTransform(pts, robin)
places <- spTransform(places, robin)

# compute the distance (makes a pretty big matrix so you should do this
# in chunks unless you have a ton of memory or do it row-by-row
far <- gDistance(pts, places, byid=TRUE)

# find the closest one
closest <- apply(far, 1, which.min)

# map to the fields (you may want to map to other fields)
locs <- places@data[closest, c("NAME", "ADM1NAME", "ISO_A2")]

locs[sample(nrow(locs), 10),]

##              NAME        ADM1NAME ISO_A2
## 3274     Szczecin West Pomeranian     PL
## 1039     Balakhna      Nizhegorod     RU
## 1012       Chitre         Herrera     PA
## 3382     L'Aquila         Abruzzo     IT
## 1982       Dothan         Alabama     US
## 5159 Bayankhongor     Bayanhongor     MN
## 620        Deming      New Mexico     US
## 1907   Fort Smith        Arkansas     US
## 481      Dedougou        Mou Houn     BF
## 7169       Prague          Prague     CZ

库（sp）
图书馆（rgeos）
图书馆（rgdal）
#世界各地的形状文件
URL1不知道如何在R中做到这一点，但我认为您的代码是同步的，这意味着一次只发送1个HTTP请求。如果您能够一次发送10个HTTP请求，您可能会获得大约5倍的速度。您的问题有一个内置的假设，即您需要为每个1.28m不同的位置进行API调用，这仅仅是因为您看到的SO解决方案是did、 但是离线查找更好。您可能希望编辑您的问题以分离假设。不确定如何在R中实现这一点，但我认为您的代码是同步的，这意味着一次只发送1个HTTP请求。如果您能够一次发送10个HTTP请求，您可能会获得大约5倍的速度。您的问题有一个内置的假设，您需要进行验证每个1.28m不同位置都有一个API调用，这只是因为您所研究的SO解决方案有。但是脱机查找更好。您可能希望编辑您的问题以分离假设。这是一些解释！非常感谢如此详细的回答。是的！非常感谢。默认情况下，distm
使用哪个距离函数当未指定距离函数时？Haversine。您可以在函数中指定，请参阅文档。我尝试以5k行的块执行代码，直到出现内存超过R的错误。因此，我一次最多执行20k行。令人惊讶的是，20k行的矩阵生成只需2分钟，就可以完成矩阵大小为1.2gb，这样生成1.2mil行大约只需要2小时。这节省了大量的时间。这是一些解释！非常感谢如此详细的响应。是的！非常感谢！非常感谢。当未指定距离函数时，distm默认使用哪个距离函数？Haversine.Y您可以在函数中指定，请参阅文档。我尝试以5k行的块执行您的代码，直到在内存超过时从R得到错误。因此，我一次最多生成20k行。令人惊讶的是，生成20k行的矩阵只需2分钟，使矩阵大小为1.2gb，而1.2mil行只需2分钟矩阵生成大约需要2个小时。这节省了大量时间。
# work out which index in the matrix is closest to the data
> closest.index <- apply(distance.matrix, 1, which.min)

# rbind city and country of match with original query
> candidate.longlat <- cbind(candidate.longlat, cities.data[closest.index, c("city", "country")])
> print(candidate.longlat)

  vid    lat   long       city    country
1   1  12.53 -70.03 Oranjestad      Aruba
2   2 -16.31 -48.95   Anapolis     Brazil
3   3  42.87  74.59    Bishkek Kyrgyzstan

library(sp)
library(rgeos)
library(rgdal)

# world places shapefile
URL1 <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_populated_places.zip"
fil1 <- basename(URL1)
if (!file.exists(fil1)) download.file(URL1, fil1)
unzip(fil1)

places <- readOGR("ne_10m_populated_places.shp", "ne_10m_populated_places",
                  stringsAsFactors=FALSE)

# some data from the other answer since you didn't provide any
URL2 <- "http://simplemaps.com/resources/files/world/world_cities.csv"
fil2 <- basename(URL2)
if (!file.exists(fil2)) download.file(URL2, fil2)

# we need the points from said dat
dat <- read.csv(fil2, stringsAsFactors=FALSE)
pts <- SpatialPoints(dat[,c("lng", "lat")], CRS(proj4string(places)))

# this is not necessary
# I just don't like the warning about longlat not being a real projection
robin <- "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
pts <- spTransform(pts, robin)
places <- spTransform(places, robin)

# compute the distance (makes a pretty big matrix so you should do this
# in chunks unless you have a ton of memory or do it row-by-row
far <- gDistance(pts, places, byid=TRUE)

# find the closest one
closest <- apply(far, 1, which.min)

# map to the fields (you may want to map to other fields)
locs <- places@data[closest, c("NAME", "ADM1NAME", "ISO_A2")]

locs[sample(nrow(locs), 10),]

##              NAME        ADM1NAME ISO_A2
## 3274     Szczecin West Pomeranian     PL
## 1039     Balakhna      Nizhegorod     RU
## 1012       Chitre         Herrera     PA
## 3382     L'Aquila         Abruzzo     IT
## 1982       Dothan         Alabama     US
## 5159 Bayankhongor     Bayanhongor     MN
## 620        Deming      New Mexico     US
## 1907   Fort Smith        Arkansas     US
## 481      Dedougou        Mou Houn     BF
## 7169       Prague          Prague     CZ