在Python中高效地检查数百万个图像URL_Python_Python 2.7_For Loop_Optimization

在Python中高效地检查数百万个图像URL

python python-2.7 for-loop optimization

在Python中高效地检查数百万个图像URL,python,python-2.7,for-loop,optimization,Python,Python 2.7,For Loop,Optimization,我有一个tsv文件，其中包含超过300万个项目行。在这里，每个项目都有一个id、组和url，并且组列被排序 i、 e 我将其加载到一个python脚本中，在将这些项目加载到数据库之前，需要检查组中所有项目的url的状态是否为OK。我考虑过使用进程并对每个进程进行URL检查（我没有太多的经验，所以不确定这是否是个好主意）我的逻辑atm：用gr1填充数组a1->将a1中的每个项目传递给一个新进程->该进程检查200->如果确定，将其放入数组a2->当a1中的所有项目都被检查时，将a2推到DB（以及

我有一个tsv文件，其中包含超过300万个项目行。在这里，每个项目都有一个id、组和url，并且组列被排序

i、 e

我将其加载到一个python脚本中，在将这些项目加载到数据库之前，需要检查组中所有项目的url的状态是否为OK。我考虑过使用进程并对每个进程进行URL检查（我没有太多的经验，所以不确定这是否是个好主意）

我的逻辑atm：用gr1填充数组a1->将a1中的每个项目传递给一个新进程->该进程检查200->如果确定，将其放入数组a2->当a1中的所有项目都被检查时，将a2推到DB（以及其他内容）->重复

10万件物品大概需要30分钟。瓶颈是URL检查。如果不检查URL，相比之下，脚本速度非常快。到目前为止：

import csv
import re
import requests
import multiprocessing
from pymongo import MongoClient
import sys

#Load in Data
f = open('../tsvsorttest.tsv', 'rb')
reader = csv.reader(f, delimiter='\n')

#Get the first group name
currGroup = re.split(r'\t', next(reader)[0].decode('utf8'))[1]
currGroupNum = 0 
items = []
checkedItems = []

#Method that checks the URL, if its 200, add to newItems
def check_url(newitem):
    if requests.get(newitem['image_url']).status_code is 200:
        print('got an ok!')
        checkedItems.append(newitem)
    global num_left
    num_left -= 1


def clear_img(checkitems):
    for thisItem in checkitems:
        p = multiprocessing.Process(target=check_url(thisItem))
        p.start()

#Start the loop, use i to keep track of the iteration count
for i, utf8_row in enumerate(reader):
    unicode_row = utf8_row[0].decode('utf8')

    x = re.split(r'\t', unicode_row)

    item = {"id": x[0],
            "group": x[1],
            "item_url": x[2]
            }
    if currGroup != x[1]:
        y = len(items)
        print('items length is ' + str(y))

        #Dont want single item groups
        if y > 1:
            print 'beginning url checks'
            num_left = len(items)


            clear_img(items)
            while num_left is not 0:
                print 'Waiting'

            num_left = 0
            batch = {"vran": currGroup,
                     "bnum": currGroupNum,
                     "items": newItems,
                     }
            if len(checkedItems) > 0:
                batches.insert_one(batch)
                currGroupNum += 1

        currGroup = x[1]
        items = []
        checkedItems = []

    items.append(item)

    if i % 100 == 0:
        print "Milestone: " + str(i)

print "done"

其他注意事项：将原始Tsv拆分为30个单独的Tsv文件，并并行运行批处理脚本30次。这会有所不同吗

由于您不需要实际图像，使用头部请求应提高速度。如果响应既不是200也不是404，则可能不允许HEAD（405），您只需使用GET请求重试

您当前正在等待当前组完成，然后再开始任何新任务。通常，最好始终保持相同数量的运行请求大致相同。此外，您可能还希望大幅增加工作人员池—因为任务主要是I/O绑定的，因此我建议您按照3的思路进行操作（即异步I/O）

如果您愿意使用Python 3，您可以通过以下方式利用对异步I/O（）的出色支持：

由于您不需要实际图像，使用头部请求应提高速度。如果响应既不是200也不是404，则可能不允许HEAD（405），您只需使用GET请求重试

如果您愿意使用Python 3，您可以通过以下方式利用对异步I/O（）的出色支持：

前面已经提到，您应该尝试使用

HEAD

而不是

GET

。这样可以避免下载图像。此外，您似乎正在为每个请求生成一个单独的进程，这也是低效的

我认为在性能方面，这里并不需要使用asyncio。使用普通线程池（甚至不是进程池）的解决方案更容易掌握，IMHO:）此外，它在Python2.7中可用

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
from collections import defaultdict

def read_rows(file):
    with open(file) as f_in:
        return [row for row in csv.reader(f_in, delimiter='\t')]

def check_url(inp):
    """Gets called by workers in thread pool. Checks for existence of URL."""
    id, grp, url = inp
    def chk():
        try:
            return requests.head(url).status_code == 200
        except IOError as e:
            return False
    return (id, grp, url, chk())

if __name__ == '__main__':
    d = defaultdict(lambda: [])
    with ThreadPoolExecutor(max_workers=20) as executor:
        future_to_input = {executor.submit(check_url, inp): inp for inp in read_rows('urls.txt')}
        for future in as_completed(future_to_input):
            id, grp, url, res = future.result()
            d[grp].append((id, url, res))
    # do something with your d (e.g. sort appropriately, filter those with len(d[grp]) <= 1, ...)
    for g, bs in d.items():
        print(g)
        for id, url, res in bs:
            print("  %s %5s %s" % (id, res, url))

导入请求
从concurrent.futures导入ThreadPoolExecutor，完成时
导入csv
从集合导入defaultdict
def read_行（文件）：
以f_形式打开（文件）：
返回[csv.reader中的行对行（f_-in，分隔符='\t'）]
def检查url（inp）：
“”“被线程池中的工作线程调用。检查URL是否存在。”“”
id，grp，url=inp
def chk（）：
尝试：
返回请求。头（url）。状态\代码==200
除IOE错误外：
返回错误
返回（id，grp，url，chk（））
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
d=defaultdict（lambda:[]）
使用ThreadPoolExecutor（最大工作线程数=20）作为执行器：
future_to_input={executor.submit（检查_url，inp）：inp for inp in read_行（'url.txt'）}
对于完成时的未来（未来到未来输入）：
id，grp，url，res=future.result（）
d[grp].append（（id，url，res））
#用你的d做点什么（例如，适当地排序，用len（d[grp]）过滤那些。已经提到，你应该尝试使用HEAD
，而不是GET
。这将避免下载图像。此外，你似乎在为每个请求生成一个单独的进程，这也是低效的
就性能而言，我不认为这里真的需要使用asyncio。使用普通线程池（甚至不是进程池）的解决方案更容易掌握，IMHO:）此外，它在Python 2.7中可用
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
from collections import defaultdict

def read_rows(file):
    with open(file) as f_in:
        return [row for row in csv.reader(f_in, delimiter='\t')]

def check_url(inp):
    """Gets called by workers in thread pool. Checks for existence of URL."""
    id, grp, url = inp
    def chk():
        try:
            return requests.head(url).status_code == 200
        except IOError as e:
            return False
    return (id, grp, url, chk())

if __name__ == '__main__':
    d = defaultdict(lambda: [])
    with ThreadPoolExecutor(max_workers=20) as executor:
        future_to_input = {executor.submit(check_url, inp): inp for inp in read_rows('urls.txt')}
        for future in as_completed(future_to_input):
            id, grp, url, res = future.result()
            d[grp].append((id, url, res))
    # do something with your d (e.g. sort appropriately, filter those with len(d[grp]) <= 1, ...)
    for g, bs in d.items():
        print(g)
        for id, url, res in bs:
            print("  %s %5s %s" % (id, res, url))

导入请求
从concurrent.futures导入ThreadPoolExecutor，完成时
导入csv
从集合导入defaultdict
def read_行（文件）：
以f_形式打开（文件）：
返回[csv.reader中的行对行（f_-in，分隔符='\t'）]
def检查url（inp）：
“”“被线程池中的工作线程调用。检查URL是否存在。”“”
id，grp，url=inp
def chk（）：
尝试：
返回请求。头（url）。状态\代码==200
除IOE错误外：
返回错误
返回（id，grp，url，chk（））
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
d=defaultdict（lambda:[]）
使用ThreadPoolExecutor（最大工作线程数=20）作为执行器：
future_to_input={executor.submit（检查_url，inp）：inp for inp in read_行（'url.txt'）}
对于完成时的未来（未来到未来输入）：
id，grp，url，res=future.result（）
d[grp].append（（id，url，res））
#用你的d做点什么（例如，适当地排序，用len（d[grp]）过滤那些）如果图像是从“普通”web服务器请求的，您可以执行HEAD而不是GET请求。啊，是的，这应该会有帮助，我会尝试一下。从web服务器获取响应的异步特性不太适合多处理库，多处理库的目标是跨CPU核心分发任务。您可能希望大幅增加工作池的大小，以允许所有I/O绑定的阻塞。顺便说一句，Python中的for循环相当慢。如果效率很高，我建议用C/C++实现它，然后用Python包装它。具体取决于级别
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
from collections import defaultdict

def read_rows(file):
    with open(file) as f_in:
        return [row for row in csv.reader(f_in, delimiter='\t')]

def check_url(inp):
    """Gets called by workers in thread pool. Checks for existence of URL."""
    id, grp, url = inp
    def chk():
        try:
            return requests.head(url).status_code == 200
        except IOError as e:
            return False
    return (id, grp, url, chk())

if __name__ == '__main__':
    d = defaultdict(lambda: [])
    with ThreadPoolExecutor(max_workers=20) as executor:
        future_to_input = {executor.submit(check_url, inp): inp for inp in read_rows('urls.txt')}
        for future in as_completed(future_to_input):
            id, grp, url, res = future.result()
            d[grp].append((id, url, res))
    # do something with your d (e.g. sort appropriately, filter those with len(d[grp]) <= 1, ...)
    for g, bs in d.items():
        print(g)
        for id, url, res in bs:
            print("  %s %5s %s" % (id, res, url))