Can';t在python中使用多线程读取/写入文件
我有一个输入文件,其中包含一长串URL。让我们在Can';t在python中使用多线程读取/写入文件,python,python-3.x,multithreading,python-multiprocessing,python-multithreading,Python,Python 3.x,Multithreading,Python Multiprocessing,Python Multithreading,我有一个输入文件,其中包含一长串URL。让我们在mylines.txt中假设这一点: https://yahoo.com https://google.com https://facebook.com https://twitter.com 我需要做的是: 从输入文件mylines.txt 执行myFun功能。它将执行一些任务。并生成由一行组成的输出。在我的真实代码中,它更复杂。但概念上是这样的 将输出写入results.txt文件 因为我有大量的投入。我需要利用python多线程。我看着这个
mylines.txt
中假设这一点:
https://yahoo.com
https://google.com
https://facebook.com
https://twitter.com
我需要做的是:
mylines.txt
myFun
功能。它将执行一些任务。并生成由一行组成的输出。在我的真实代码中,它更复杂。但概念上是这样的results.txt
文件请求
,以便我能在合理的时间内完成列表
更新:
根据答案,代码变为:
import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import queue
from multiprocessing import Queue
def myFunc(url):
response = requests.get(url, verify=False ,timeout=(2, 5))
return "url is:" + url + ", response is:" + response.url
worker_data = open("mylines.txt","r") # open my input file.
#load up a queue with your data, this will handle locking
q = queue.Queue(4)
with open("mylines.txt","r") as f: # open my input file.
for url in f:
q.put(url)
# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)
with open("myresults","w") as f:
for line in results:
f.write(line + '\n')
mylines.txt包含:
https://yahoo.com
https://www.google.com
https://facebook.com
https://twitter.com
请注意,我首先使用的是:
import Queue
以及:
q=队列。队列(4)
但有一个错误是:
Traceback (most recent call last):
File "test3.py", line 4, in <module>
import Queue
ModuleNotFoundError: No module named 'Queue'
而有关的路线是:
q=队列。队列(4)
我还补充说:
from multiprocessing import Queue
但什么都不管用。任何python多线程专家都能提供帮助吗?您应该更改函数以返回字符串:
def myFunc(url):
response = requests.get(url, verify=False ,timeout=(2, 5))
return "url is:" + url + ", response is:" + response.url
然后将这些字符串写入文件:
results = pool.map(myFunc, q)
with open("myresults","w") as f:
for line in results:
f.write(line + '\n')
这会使多线程处理对请求.get
有效,但会将结果串行写入输出文件
更新:
您还应使用with
读取输入文件:
#load up a queue with your data, this will handle locking
q = Queue.Queue()
with open("mylines.txt","r") as f: # open my input file.
for url in f:
q.put(url)
与其让工作池线程打印出结果(这不能保证正确缓冲输出),不如再创建一个线程,从第二个
队列读取结果并打印它们
我已经修改了您的解决方案,因此它可以构建自己的工作线程池。为队列指定有限长度没有什么意义,因为当队列达到最大大小时,主线程将阻塞:您只需要它足够长,以确保工作线程始终有工作要处理-主线程将随着队列大小的增加和减少而阻塞和解除阻塞
它还标识了负责输出队列中每个项目的线程,这将使您对多线程方法正在工作有一定的信心,并从服务器打印响应代码。我发现我不得不从URL中删除换行符
因为现在只有一个线程在向文件写入,所以写入操作总是完全同步的,不会相互干扰
import threading
import requests
import queue
POOL_SIZE = 4
def myFunc(inq, outq): # worker thread deals only with queues
while True:
url = inq.get() # Blocks until something available
if url is None:
break
response = requests.get(url.strip(), timeout=(2, 5))
outq.put((url, response, threading.currentThread().name))
class Writer(threading.Thread):
def __init__(self, q):
super().__init__()
self.results = open("myresults","a") # "a" to append results
self.queue = q
def run(self):
while True:
url, response, threadname = self.queue.get()
if response is None:
self.results.close()
break
print("****url is:",url, ", response is:", response.status_code, response.url, "thread", threadname, file=self.results)
#load up a queue with your data, this will handle locking
inq = queue.Queue() # could usefully limit queue size here
outq = queue.Queue()
# start the Writer
writer = Writer(outq)
writer.start()
# make the Pool of workers
threads = []
for i in range(POOL_SIZE):
thread = threading.Thread(target=myFunc, name=f"worker{i}", args=(inq, outq))
thread.start()
threads.append(thread)
# push the work onto the queues
with open("mylines.txt","r") as worker_data: # open my input file.
for url in worker_data:
inq.put(url.strip())
for thread in threads:
inq.put(None)
# close the pool and wait for the workers to finish
for thread in threads:
thread.join()
# Terminate the writer
outq.put((None, None, None))
writer.join()
使用mylines.txt
中给出的数据,我可以看到以下输出:
****url is: https://www.google.com , response is: 200 https://www.google.com/ thread worker1
****url is: https://twitter.com , response is: 200 https://twitter.com/ thread worker2
****url is: https://facebook.com , response is: 200 https://www.facebook.com/ thread worker0
****url is: https://www.censys.io , response is: 200 https://censys.io/ thread worker1
****url is: https://yahoo.com , response is: 200 https://uk.yahoo.com/?p=us thread worker3
@philshem她是个问题。谢谢你的帮助。文件中有数百万行可以使用吗?这是另一个问题。也许您可以收集所有这些代码,确保它适用于小的输入文件,然后尝试用于更大的文件。如果有问题,那么你可以把你的症状作为一个新问题发布。我做了建议的修改。我得到:ModuleNotFoundError:没有名为“Queue”的模块
,原因不清楚?我的原始代码不起作用。我根据您的建议更新了代码,并尝试了一些解决队列问题的解决方案。我最终没有错误,没有挂起,也没有输出。我使用Python3.6,一些帖子说它是3.6中的小写字母(queue)。您是否在自己这边运行过它?它永远伴随着我。光标只是闪烁,它挂起了。单击CTRL+C退出后,我得到了以下信息:^CEException在:回溯(最近一次调用):文件“/usr/lib/python3.6/threading.py”,第1294行,在join self的“shutdown t t.join()”文件“/usr/lib/python3.6/threading.py”第1056行中被忽略。等待状态锁定()文件“/usr/lib/python3.6/threading.py”,第1072行,在等待状态锁elif lock.acquire中(块,超时):KeyboardInterrupt
我正在测试问题中输入的5行代码。我使用python3
命令和Ubunut 18.04中的系统。我看到了创建的输出文件,但没有写入任何内容。python程序永远不会结束。即使没有指示器,光标也应该停止,但现在只是闪烁。它似乎可以工作。可以确保。但我的问题是复制不正确。它运行但不正确。不应重复输出。它只应执行请求。在文件中读取每个URL时获取一次。为什么重复?
import threading
import requests
import queue
POOL_SIZE = 4
def myFunc(inq, outq): # worker thread deals only with queues
while True:
url = inq.get() # Blocks until something available
if url is None:
break
response = requests.get(url.strip(), timeout=(2, 5))
outq.put((url, response, threading.currentThread().name))
class Writer(threading.Thread):
def __init__(self, q):
super().__init__()
self.results = open("myresults","a") # "a" to append results
self.queue = q
def run(self):
while True:
url, response, threadname = self.queue.get()
if response is None:
self.results.close()
break
print("****url is:",url, ", response is:", response.status_code, response.url, "thread", threadname, file=self.results)
#load up a queue with your data, this will handle locking
inq = queue.Queue() # could usefully limit queue size here
outq = queue.Queue()
# start the Writer
writer = Writer(outq)
writer.start()
# make the Pool of workers
threads = []
for i in range(POOL_SIZE):
thread = threading.Thread(target=myFunc, name=f"worker{i}", args=(inq, outq))
thread.start()
threads.append(thread)
# push the work onto the queues
with open("mylines.txt","r") as worker_data: # open my input file.
for url in worker_data:
inq.put(url.strip())
for thread in threads:
inq.put(None)
# close the pool and wait for the workers to finish
for thread in threads:
thread.join()
# Terminate the writer
outq.put((None, None, None))
writer.join()
****url is: https://www.google.com , response is: 200 https://www.google.com/ thread worker1
****url is: https://twitter.com , response is: 200 https://twitter.com/ thread worker2
****url is: https://facebook.com , response is: 200 https://www.facebook.com/ thread worker0
****url is: https://www.censys.io , response is: 200 https://censys.io/ thread worker1
****url is: https://yahoo.com , response is: 200 https://uk.yahoo.com/?p=us thread worker3