Python 数据与多处理
我正在使用FCC api将纵横坐标转换为块组码:Python 数据与多处理,python,pandas,multiprocessing,Python,Pandas,Multiprocessing,我正在使用FCC api将纵横坐标转换为块组码: import pandas as pd import numpy as np import urllib import time import json # getup, getup1, and getup2 make up the url to the api getup = 'http://data.fcc.gov/api/block/find?format=json&latitude=' getup1 = '&longi
import pandas as pd
import numpy as np
import urllib
import time
import json
# getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='
getup1 = '&longitude='
getup2 = '&showall=false'
lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
'32.7554883','42.331427','31.7775757','35.1495343']
long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']
#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']
new_list = []
def block(x):
for index,row in x.iterrows():
#request url and read the output
a = urllib.request.urlopen(getup + row['lat'] + getup1 + row['long'] + getup2).read()
#load json output in to a form python can understand
a1 = json.loads(a)
#append output to an empty list.
new_list.append(a1['Block']['FIPS'])
#call the function with latlong as the argument.
block(latlong)
#print the list, note: it is important that function appends to the list
print(new_list)
给出此输出:
['360610031001021', '060372074001033', '170318391001104', '482011000003087',
'421010005001010', '040131141001032', '480291101002041', '060730053003011',
'481130204003064', '060855010004004', '484530011001092', '180973910003057',
'120310010001023', '060750201001001', '390490040001005', '371190001005000',
'484391233002071', '261635172001069', '481410029001001', '471570042001018']
这个脚本的问题是,我每行只能调用一次api。脚本运行大约需要每千分之五的时间,对于我计划使用此脚本的1000000多个条目来说,这是不可接受的
我想使用多处理来并行该函数,以减少运行该函数的时间。我曾试图查阅《多处理手册》,但未能弄清楚如何运行该函数并将输出并行地附加到空列表中
仅供参考:我使用的是python 3.6
任何指导都会很好 您不必自己实现并行性,有一些库比urllib更好,例如请求[0]和一些使用线程或未来的衍生产品[1]。我想你需要检查一下自己哪一个是最快的 由于依赖性较小,我最喜欢未来的请求,下面是我使用十个线程实现的代码。如果您认为或发现在您的情况下,该库甚至会支持流程:
import pandas as pd
import numpy as np
import urllib
import time
import json
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
#getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='
getup1 = '&longitude='
getup2 = '&showall=false'
lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
'33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
'39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
'32.7554883','42.331427','31.7775757','35.1495343']
long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
'-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
'-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
'-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']
#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']
def block(x):
requests = []
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=10))
for index, row in x.iterrows():
#request url and read the output
url = getup+row['lat']+getup1+row['long']+getup2
requests.append(session.get(url))
new_list = []
for request in requests:
#load json output in to a form python can understand
a1 = json.loads(request.result().content)
#append output to an empty list.
new_list.append(a1['Block']['FIPS'])
return new_list
#call the function with latlong as the argument.
new_list = block(latlong)
#print the list, note: it is important that function appends to the list
print(new_list)
[0]
[1] 嘿,你可能想看一下。大多数情况下,在python中使用Parallelism会提高计算时间,而不是减少计算时间。由于受IO限制,线程在这里有意义,因此必须重新构造问题,以避免附加到全局列表。这里的文档是一个很好的起点-@Tbaki
多处理
不受GIL的影响,事实上,创建它是为了提供一个类似于线程
的api来创建多个进程,从而绕过GIL的限制。正如@chrisb所指出的,由于此代码是IO绑定的,线程也不会受到GIL的限制。@juanpa.arrivillaga感谢您提供的信息!:这很有效!我从千分之五变为千分之一。