Python 通过网络IO和数据库查询改进多线程处理_Python_Multithreading_Multiprocessing_Mysql Python_Pycurl

Python 通过网络IO和数据库查询改进多线程处理

python multithreading

Python 通过网络IO和数据库查询改进多线程处理,python,multithreading,multiprocessing,mysql-python,pycurl,Python,Multithreading,Multiprocessing,Mysql Python,Pycurl,我正在写一个脚本：从数据库中获取URL列表（约10000个URL）下载所有页面并将其插入数据库解析代码如果（某些条件）在数据库中进行其他插入我有一个带超线程的Xeon四核，所以总共有8个线程可用，我在Linux（64位）下我使用cStringIO作为缓冲区，pycurl获取页面，BeautifulSoup解析页面，MySQLdb与数据库交互我试图简化下面的代码（删除所有try/except、解析操作等）我使用了多线程处理，因此，如果某些请求因超时而失败，我不需要等待。该特定线程将

我正在写一个脚本：

从数据库中获取URL列表（约10000个URL）

下载所有页面并将其插入数据库

解析代码

如果（某些条件）在数据库中进行其他插入

我有一个带超线程的Xeon四核，所以总共有8个线程可用，我在Linux（64位）下

我使用

cStringIO

作为缓冲区，

pycurl

获取页面，

BeautifulSoup

解析页面，

MySQLdb

与数据库交互

我试图简化下面的代码（删除所有try/except、解析操作等）

我使用了

多线程处理

，因此，如果某些请求因超时而失败，我不需要等待。该特定线程将等待，但其他线程可以继续使用其他URL

除了脚本什么也不做的屏幕截图。似乎有5个内核正忙，而另一个则不忙。因此，问题是：

我应该创建与线程数量相同的游标吗
我真的需要锁定查询的执行吗？如果一个线程执行cur.execute（）而不是db.commit（），而另一个线程执行另一个查询并提交，会发生什么情况
我读过关于这个类的文章，但我不确定我是否理解正确：我可以用它来代替锁定+提取url+释放吗
使用
```
多线程
```
是否会遇到I/O（网络）瓶颈？使用100个线程，我的速度不会超过~500Kb/s，而我的连接速度会更快。如果我转到
```
多进程
```
，我会看到这方面的一些改进吗
同样的问题，但MySQL：使用我的代码，这方面可能会有瓶颈？所有这些锁+插入查询+释放都可以通过某种方式改进吗
如果方法是多线程，那么100是大量线程吗？我的意思是，太多执行I/O请求（或DB查询）的线程是无用的，因为这些操作相互排斥？或者更多线程意味着更高的网络速度

可能更适合此类问题。值得尝试多重处理。我相信你唯一需要做的就是将

threading.Thread

替换为

multiprocessing.Process

来测试它。关于@COpython我已经试过了，CPU和网络使用率似乎都在增加，但脚本正在循环，所以我不知道性能更好的原因。

import cStringIO, threading, MySQLdb.cursors, pycurl

NUM_THREADS = 100
lock_list = threading.Lock()
lock_query = threading.Lock()


db = MySQLdb.connect(host = "...", user = "...", passwd = "...", db = "...", cursorclass=MySQLdb.cursors.DictCursor)
cur = db.cursor()
cur.execute("SELECT...")
rows = cur.fetchall()
rows = [x for x in rows]  # convert to a list so it's editable


class MyThread(threading.Thread):
    def run(self):
        """ initialize a StringIO object and a pycurl object """

        while True:
            lock_list.acquire()  # acquire the lock to extract a url
            if not rows:  # list is empty, no more url to process
                lock_list.release()
                break
            row = rows.pop()
            lock_list.release()

            """ download the page with pycurl and do some check """

            """ WARNING: possible bottleneck if all the pycurl
                connections are waiting for the timeout """

            lock_query.acquire()
            cur.execute("INSERT INTO ...")  # insert the full page into the database
            db.commit()
            lock_query.release()

            """do some parse with BeautifulSoup using the StringIO object"""

            if something is not None:
                lock_query.acquire()
                cur.execute("INSERT INTO ...")  # insert the result of parsing into the database
                db.commit()
                lock_query.release()


# create and start all the threads
threads = []
for i in range(NUM_THREADS):
    t = MyThread()
    t.start()
    threads.append(t)

# wait for threads to finish
for t in threads:
    t.join()