Python 如何使用线程处理多个GET请求并进行比较

Python 如何使用线程处理多个GET请求并进行比较,python,multithreading,for-loop,Python,Multithreading,For Loop,我一直在试图找出如何加快速度,并获得一些线程方面的知识 我一直在尝试创建一个函数,其中我放置了两个GET请求。对于每个链接,我会刮取一些数据,然后将其保存到一个返回的列表中,然后使用该列表进行比较,以查看其中一个链接中是否出现了新链接: """ def getScrapeLinks(self, siteURL): response = requests.get( siteURL, timeout=5

我一直在试图找出如何加快速度,并获得一些线程方面的知识

我一直在尝试创建一个函数,其中我放置了两个GET请求。对于每个链接,我会刮取一些数据,然后将其保存到一个返回的列表中,然后使用该列表进行比较,以查看其中一个链接中是否出现了新链接:

"""
def getScrapeLinks(self, siteURL):
    response = requests.get(
                    siteURL,
                    timeout=5
                )

    if response.ok:
        bs4 = soup(response.text, 'lxml')

        links = ['{}'.format( raw_product.find('a').get('href')) for
                    raw_product in bs4.find_all('div', {'class': 'test'})]

        return links

"""

def pollNewProducts(self, storeClass):

    # storeClass.siteCatalog = ["https://www.google.com", "https://www.facebook.com"]

    LinksLists = reduce(operator.add, [getScrapeLinks(getLinks) for getLinks in storeClass.siteCatalog])

    while True:

        newLinksLists = reduce(operator.add,
                                 [getScrapeLinks(getLinks) for getLinks in storeClass.siteCatalog]
                                 )

        for URL in newLinksLists:
            if URL not in LinksLists:
                print("New link")
                print(URL)
                LinksLists.append(URL)
        else:
            print("Sleep to see new links!")
            time.sleep(random.randint(2, 4))
现在我的问题是,我使用“reduce”功能,它首先对etc-Google进行第一次请求,然后获取所需的数据,一旦完成,然后我进行第二次请求,即Facebook。 我想在这里做的是加速它,使每个链接都有自己的线程,这样它就可以同时运行,而不是依赖于每个链接


我想知道,如果get请求中出现了新的URL,我如何能够单独运行每个链接,并且仍然能够比较和获取新的URL?

根据我对问题的回答进行调整


您应该研究异步编程。与线程不同,异步代码在同一线程中运行,但它在事件循环中运行。当Python关键字
await
存在时,此事件循环自动在不同操作之间切换上下文

换言之,可以将抓取网站想象为:

client sends request -> ... waiting for server reply ... <- server replies
client sends request -> switch operation -> ... wait ... <- server replies
client sends request -> switch operation -> ... wait ... <- server replies
client sends request -> switch operation -> ... wait ... <- server replies
...
产出:

Successfully got url http://www.google.com.br with response of length 12188.
Successfully got url http://www.google.it with response of length 12155.
Successfully got url https://www.t.co with response of length 0.
Successfully got url http://www.msn.com with response of length 46335.
Successfully got url http://www.chinadaily.com.cn with response of length 122053.
Successfully got url https://www.google.co.in with response of length 11557.
Successfully got url https://www.google.de with response of length 12135.
Successfully got url https://www.facebook.com with response of length 115258.
Successfully got url http://www.gmw.cn with response of length 120866.
Successfully got url https://www.google.co.uk with response of length 11540.
Successfully got url https://www.google.fr with response of length 12189.
Successfully got url http://www.google.es with response of length 12163.
Successfully got url http://www.google.co.id with response of length 12169.
Successfully got url https://www.bing.com with response of length 117915.
Successfully got url https://www.instagram.com with response of length 36307.
Successfully got url https://www.google.ru with response of length 12128.
Successfully got url http://www.googleusercontent.com with response of length 1561.
Successfully got url http://www.xinhuanet.com with response of length 179254.
Successfully got url http://www.google.ca with response of length 11592.
Successfully got url http://www.accuweather.com with response of length 269.
Successfully got url http://www.googleadservices.com with response of length 1561.
Successfully got url https://www.whatsapp.com with response of length 77951.
Successfully got url http://www.cntv.cn with response of length 3139.
Successfully got url http://www.google.com.au with response of length 11579.
Successfully got url https://www.example.com with response of length 1270.
Successfully got url http://www.google.co.th with response of length 12151.
Successfully got url https://www.amazon.com with response of length 465905.
Successfully got url https://www.wikipedia.org with response of length 76240.
Successfully got url https://www.google.co.kr with response of length 12211.
Successfully got url https://www.apple.com with response of length 63322.
Successfully got url http://www.uol.com.br with response of length 333257.
Successfully got url https://www.aliexpress.com with response of length 59742.
Successfully got url http://www.sohu.com with response of length 215201.
Successfully got url https://www.google.pl with response of length 12144.
Successfully got url https://www.googleweblight.com with response of length 0.
Successfully got url https://www.cnn.com with response of length 1138392.
Successfully got url https://www.google.com.ph with response of length 11561.
Successfully got url https://www.linkedin.com with response of length 71498.
Successfully got url https://www.naver.com with response of length 176038.
Successfully got url https://www.live.com with response of length 3667.
Successfully got url https://www.twitch.tv with response of length 61599.
Successfully got url http://www.163.com with response of length 696338.
Successfully got url https://www.ebay.com with response of length 307068.
Successfully got url https://www.wordpress.com with response of length 76680.
Successfully got url https://www.wikia.com with response of length 291400.
Successfully got url http://www.chrome.com with response of length 161223.
Successfully got url https://www.twitter.com with response of length 291741.
Successfully got url https://www.stackoverflow.com with response of length 105987.
Successfully got url https://www.netflix.com with response of length 83125.
Successfully got url https://www.tumblr.com with response of length 78110.
Successfully got url http://www.doubleclick.net with response of length 129901.
Successfully got url https://www.yahoo.com with response of length 531829.
Successfully got url http://www.soso.com with response of length 174.
Successfully got url https://www.microsoft.com with response of length 187549.
Successfully got url http://www.office.com with response of length 89556.
Successfully got url http://www.alibaba.com with response of length 167978.
Successfully got url https://www.reddit.com with response of length 483295.
Successfully got url http://www.outbrain.com with response of length 24432.
Successfully got url http://www.tianya.cn with response of length 7941.
Successfully got url https://www.baidu.com with response of length 156768.
Successfully got url http://www.diply.com with response of length 3074314.
Successfully got url http://www.blogspot.com with response of length 94478.
Successfully got url http://www.popads.net with response of length 14548.
Successfully got url http://www.answers.yahoo.com with response of length 104726.
Successfully got url http://www.blogger.com with response of length 94478.
Successfully got url http://www.imgur.com with response of length 4008.
Successfully got url http://www.qq.com with response of length 244841.
Successfully got url http://www.paypal.com with response of length 45587.
Successfully got url http://www.pinterest.com with response of length 45692.
Successfully got url http://www.github.com with response of length 86917.
Successfully got url http://www.zhihu.com with response of length 31473.
Successfully got url http://www.go.com with response of length 594291.
Successfully got url http://www.fc2.com with response of length 34546.
Successfully got url https://www.amazon.de with response of length 439209.
Successfully got url https://www.youtube.com with response of length 439571.
Successfully got url http://www.bbc.co.uk with response of length 321966.
Successfully got url http://www.tmall.com with response of length 234388.
Successfully got url http://www.imdb.com with response of length 289339.
Successfully got url http://www.dropbox.com with response of length 103714.
Successfully got url http://www.bilibili.com with response of length 50959.
Successfully got url http://www.jd.com with response of length 18105.
Successfully got url http://www.yahoo.co.jp with response of length 18565.
Successfully got url https://www.amazon.co.jp with response of length 479721.
Successfully got url http://www.craigslist.org with response of length 59372.
Successfully got url https://www.360.cn with response of length 74502.
Successfully got url http://www.ok.ru with response of length 170516.
Successfully got url https://www.amazon.in with response of length 460696.
Successfully got url http://www.booking.com with response of length 408992.
Successfully got url http://www.yandex.ru with response of length 116661.
Successfully got url http://www.nicovideo.jp with response of length 107271.
Successfully got url http://www.onet.pl with response of length 720657.
Successfully got url http://www.alipay.com with response of length 21698.
Successfully got url https://www.amazon.co.uk with response of length 443607.
Successfully got url http://www.sina.com.cn with response of length 579107.
Successfully got url http://www.hao123.com with response of length 295213.
Successfully got url http://www.pixnet.net with response of length 6295.
Successfully got url http://www.coccoc.com with response of length 45822.
Successfully got url http://www.taobao.com with response of length 393128.
Successfully got url http://www.weibo.com with response of length 95482.
Successfully got url http://www.youku.com with response of length 762485.
Finalized all. ret is a list of len 100 outputs.
Took 3.899034023284912 seconds to pull 100 websites.
正如你所看到的,在我的互联网连接(佛罗里达州迈阿密)上使用aiohttp,在大约4秒钟内成功访问了来自世界各地的100个网站(有或没有https)。请记住,以下情况可能会使程序速度降低几毫秒:

  • 打印语句(是,包括上面代码中的语句)
  • 到达距离您地理位置更远的服务器

上面的例子有上面的两个例子,因此它可以说是做你所要求的事情的最不优化的方式。但是,我相信这是一个很好的开始。我建议您使用异步编程,而不是线程。哦,您有没有任何例子说明它的外观,以及为什么要使用thst@FelipeYes当然,请稍等。
等待asyncio.gather(*[get(url)用于url中的url])
我们正在使用所有url调用所有的
get()
函数
resp=wait response.read()
我们切换到不同的上下文。本质上,当我们到达
wait
关键字时,我们返回到事件循环并说“好的,还有什么可以让我不空闲呢?”要做的是继续处理更多
get()
函数。然后,一旦到达最后一个
get()
函数,就转到第一个
get()
函数并说“太棒了,你完成获取了吗?”在大多数情况下,它会等待至少一个响应。要更直接地回答,异步代码不会同时运行。异步代码的全部目的是允许您启动一个接一个地启动外部等待期的操作,而无需等待每个外部等待期结束。这可能是从磁盘读取数以百万计的文件(从磁盘读取文件有延迟),向internet发送数以百万计的请求(延迟服务器响应),等等。如果这有意义的话,请使用Lmk。看起来您已经明白了。:)我建议大家熟悉
asyncio
——这是一个非常强大的模块,如果我自己可以这么说的话。哦,是的,我想我需要更多地阅读aobut async,并应用它,让它完全符合我的需要:DExactly。一旦你掌握了窍门,事情就会变得更加有趣祝你好运