如何使用python中的beautifulsoup并行刮取多个html页面?

如何使用python中的beautifulsoup并行刮取多个html页面?,python,django,multithreading,beautifulsoup,python-multithreading,Python,Django,Multithreading,Beautifulsoup,Python Multithreading,我正在用Python和Django web框架制作一个web垃圾应用程序。我需要使用beautifulsoup库刮取多个查询。下面是我编写的代码的快照: for url in websites: r = requests.get(url) soup = BeautifulSoup(r.content) links = soup.find_all("a", {"class":"dev-link"}) 实际上在这里

我正在用Python和Django web框架制作一个web垃圾应用程序。我需要使用beautifulsoup库刮取多个查询。下面是我编写的代码的快照:

for url in websites:
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    links = soup.find_all("a", {"class":"dev-link"})
实际上在这里,网页的抓取是按顺序进行的,我想以并行方式运行它。我不太了解Python中的线程。 有人能告诉我,我怎样才能以并行方式进行刮削吗?任何帮助都将不胜感激。

尝试此解决方案

import threading

def fetch_links(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    return soup.find_all("a", {"class": "dev-link"})

threads = [threading.Thread(target=fetch_links, args=(url,))
           for url in websites]

for t in thread:
    t.start()

通过
请求下载网页内容。get()
是一种阻塞操作,Python线程实际上可以提高性能。

如果您想使用多线程

import threading
import requests
from bs4 import BeautifulSoup

class Scraper(threading.Thread):
    def __init__(self, threadId, name, url):
        threading.Thread.__init__(self)
        self.name = name
        self.id = threadId
        self.url = url

    def run(self):
        r = requests.get(self.url)
        soup = BeautifulSoup(r.content, 'html.parser')
        links = soup.find_all("a")
        return links
#list the websites in below list
websites = []
i = 1
for url in websites:
    thread = Scraper(i, "thread"+str(i), url)
    res = thread.run()
    # print res

这可能会对python有所帮助,而抓取可能是一种方法

scrapy使用库来实现并行性,所以您不必担心线程和


如果您必须使用beautifulsoap,请查看

您一次要刮取多少个网页?