在Python中高效地检查数百万个图像URL
我有一个tsv文件,其中包含超过300万个项目行。在这里,每个项目都有一个id、组和url,并且组列被排序 i、 e 我将其加载到一个python脚本中,在将这些项目加载到数据库之前,需要检查组中所有项目的url的状态是否为OK。我考虑过使用进程并对每个进程进行URL检查(我没有太多的经验,所以不确定这是否是个好主意) 我的逻辑atm:用gr1填充数组a1->将a1中的每个项目传递给一个新进程->该进程检查200->如果确定,将其放入数组a2->当a1中的所有项目都被检查时,将a2推到DB(以及其他内容)->重复 10万件物品大概需要30分钟。瓶颈是URL检查。如果不检查URL,相比之下,脚本速度非常快。到目前为止:在Python中高效地检查数百万个图像URL,python,python-2.7,for-loop,optimization,Python,Python 2.7,For Loop,Optimization,我有一个tsv文件,其中包含超过300万个项目行。在这里,每个项目都有一个id、组和url,并且组列被排序 i、 e 我将其加载到一个python脚本中,在将这些项目加载到数据库之前,需要检查组中所有项目的url的状态是否为OK。我考虑过使用进程并对每个进程进行URL检查(我没有太多的经验,所以不确定这是否是个好主意) 我的逻辑atm:用gr1填充数组a1->将a1中的每个项目传递给一个新进程->该进程检查200->如果确定,将其放入数组a2->当a1中的所有项目都被检查时,将a2推到DB(以及
import csv
import re
import requests
import multiprocessing
from pymongo import MongoClient
import sys
#Load in Data
f = open('../tsvsorttest.tsv', 'rb')
reader = csv.reader(f, delimiter='\n')
#Get the first group name
currGroup = re.split(r'\t', next(reader)[0].decode('utf8'))[1]
currGroupNum = 0
items = []
checkedItems = []
#Method that checks the URL, if its 200, add to newItems
def check_url(newitem):
if requests.get(newitem['image_url']).status_code is 200:
print('got an ok!')
checkedItems.append(newitem)
global num_left
num_left -= 1
def clear_img(checkitems):
for thisItem in checkitems:
p = multiprocessing.Process(target=check_url(thisItem))
p.start()
#Start the loop, use i to keep track of the iteration count
for i, utf8_row in enumerate(reader):
unicode_row = utf8_row[0].decode('utf8')
x = re.split(r'\t', unicode_row)
item = {"id": x[0],
"group": x[1],
"item_url": x[2]
}
if currGroup != x[1]:
y = len(items)
print('items length is ' + str(y))
#Dont want single item groups
if y > 1:
print 'beginning url checks'
num_left = len(items)
clear_img(items)
while num_left is not 0:
print 'Waiting'
num_left = 0
batch = {"vran": currGroup,
"bnum": currGroupNum,
"items": newItems,
}
if len(checkedItems) > 0:
batches.insert_one(batch)
currGroupNum += 1
currGroup = x[1]
items = []
checkedItems = []
items.append(item)
if i % 100 == 0:
print "Milestone: " + str(i)
print "done"
其他注意事项:将原始Tsv拆分为30个单独的Tsv文件,并并行运行批处理脚本30次。这会有所不同吗
前面已经提到,您应该尝试使用
HEAD
而不是GET
。这样可以避免下载图像。此外,您似乎正在为每个请求生成一个单独的进程,这也是低效的
我认为在性能方面,这里并不需要使用asyncio。使用普通线程池(甚至不是进程池)的解决方案更容易掌握,IMHO:)此外,它在Python2.7中可用
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
from collections import defaultdict
def read_rows(file):
with open(file) as f_in:
return [row for row in csv.reader(f_in, delimiter='\t')]
def check_url(inp):
"""Gets called by workers in thread pool. Checks for existence of URL."""
id, grp, url = inp
def chk():
try:
return requests.head(url).status_code == 200
except IOError as e:
return False
return (id, grp, url, chk())
if __name__ == '__main__':
d = defaultdict(lambda: [])
with ThreadPoolExecutor(max_workers=20) as executor:
future_to_input = {executor.submit(check_url, inp): inp for inp in read_rows('urls.txt')}
for future in as_completed(future_to_input):
id, grp, url, res = future.result()
d[grp].append((id, url, res))
# do something with your d (e.g. sort appropriately, filter those with len(d[grp]) <= 1, ...)
for g, bs in d.items():
print(g)
for id, url, res in bs:
print(" %s %5s %s" % (id, res, url))
导入请求
从concurrent.futures导入ThreadPoolExecutor,完成时
导入csv
从集合导入defaultdict
def read_行(文件):
以f_形式打开(文件):
返回[csv.reader中的行对行(f_-in,分隔符='\t')]
def检查url(inp):
“”“被线程池中的工作线程调用。检查URL是否存在。”“”
id,grp,url=inp
def chk():
尝试:
返回请求。头(url)。状态\代码==200
除IOE错误外:
返回错误
返回(id,grp,url,chk())
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
d=defaultdict(lambda:[])
使用ThreadPoolExecutor(最大工作线程数=20)作为执行器:
future_to_input={executor.submit(检查_url,inp):inp for inp in read_行('url.txt')}
对于完成时的未来(未来到未来输入):
id,grp,url,res=future.result()
d[grp].append((id,url,res))
#用你的d做点什么(例如,适当地排序,用len(d[grp])过滤那些。已经提到,你应该尝试使用HEAD
,而不是GET
。这将避免下载图像。此外,你似乎在为每个请求生成一个单独的进程,这也是低效的
就性能而言,我不认为这里真的需要使用asyncio。使用普通线程池(甚至不是进程池)的解决方案更容易掌握,IMHO:)此外,它在Python 2.7中可用
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
from collections import defaultdict
def read_rows(file):
with open(file) as f_in:
return [row for row in csv.reader(f_in, delimiter='\t')]
def check_url(inp):
"""Gets called by workers in thread pool. Checks for existence of URL."""
id, grp, url = inp
def chk():
try:
return requests.head(url).status_code == 200
except IOError as e:
return False
return (id, grp, url, chk())
if __name__ == '__main__':
d = defaultdict(lambda: [])
with ThreadPoolExecutor(max_workers=20) as executor:
future_to_input = {executor.submit(check_url, inp): inp for inp in read_rows('urls.txt')}
for future in as_completed(future_to_input):
id, grp, url, res = future.result()
d[grp].append((id, url, res))
# do something with your d (e.g. sort appropriately, filter those with len(d[grp]) <= 1, ...)
for g, bs in d.items():
print(g)
for id, url, res in bs:
print(" %s %5s %s" % (id, res, url))
导入请求
从concurrent.futures导入ThreadPoolExecutor,完成时
导入csv
从集合导入defaultdict
def read_行(文件):
以f_形式打开(文件):
返回[csv.reader中的行对行(f_-in,分隔符='\t')]
def检查url(inp):
“”“被线程池中的工作线程调用。检查URL是否存在。”“”
id,grp,url=inp
def chk():
尝试:
返回请求。头(url)。状态\代码==200
除IOE错误外:
返回错误
返回(id,grp,url,chk())
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
d=defaultdict(lambda:[])
使用ThreadPoolExecutor(最大工作线程数=20)作为执行器:
future_to_input={executor.submit(检查_url,inp):inp for inp in read_行('url.txt')}
对于完成时的未来(未来到未来输入):
id,grp,url,res=future.result()
d[grp].append((id,url,res))
#用你的d做点什么(例如,适当地排序,用len(d[grp])过滤那些)如果图像是从“普通”web服务器请求的,您可以执行HEAD而不是GET请求。啊,是的,这应该会有帮助,我会尝试一下。从web服务器获取响应的异步特性不太适合多处理库,多处理库的目标是跨CPU核心分发任务。您可能希望大幅增加工作池的大小,以允许所有I/O绑定的阻塞。顺便说一句,Python中的for循环相当慢。如果效率很高,我建议用C/C++实现它,然后用Python包装它。具体取决于级别
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
from collections import defaultdict
def read_rows(file):
with open(file) as f_in:
return [row for row in csv.reader(f_in, delimiter='\t')]
def check_url(inp):
"""Gets called by workers in thread pool. Checks for existence of URL."""
id, grp, url = inp
def chk():
try:
return requests.head(url).status_code == 200
except IOError as e:
return False
return (id, grp, url, chk())
if __name__ == '__main__':
d = defaultdict(lambda: [])
with ThreadPoolExecutor(max_workers=20) as executor:
future_to_input = {executor.submit(check_url, inp): inp for inp in read_rows('urls.txt')}
for future in as_completed(future_to_input):
id, grp, url, res = future.result()
d[grp].append((id, url, res))
# do something with your d (e.g. sort appropriately, filter those with len(d[grp]) <= 1, ...)
for g, bs in d.items():
print(g)
for id, url, res in bs:
print(" %s %5s %s" % (id, res, url))