在Haskell中使用工作池运行并行URL下载_Haskell

在Haskell中使用工作池运行并行URL下载

haskell

在Haskell中使用工作池运行并行URL下载,haskell,Haskell,我想使用Control.Concurrent.Asyncmapconcurrent与httpconcurrent执行并行下载。这个解决方案对于我的案例来说是不够的，因为我希望处理n个任务，但将并发工作者的数量限制为m，其中mconcurrent a。这就是由throttledAction组成的。为了导入构造函数，而不仅仅是类型，您需要编写import-Control.Concurrent.Async（Concurrent（Concurrent）、runconcurrent）或虚线速记import

我想使用Control.Concurrent.Async

mapconcurrent

与

httpconcurrent

执行并行下载。这个解决方案对于我的案例来说是不够的，因为我希望处理n个任务，但将并发工作者的数量限制为m，其中m 同时传递给

map

多个m块是不够的，因为这样，活跃的工作人员的数量将趋向于少于m，因为一些任务将比其他任务更早完成，留下一个利用率差距

有没有一种简单的方法——几乎像我希望的同时使用

MapConcurrent

一样简单——来实现一个工作池，同时执行一个任务队列，直到所有任务都完成

或者简单地保持Haskell并使用

xargs-p

进行流程级并行更容易吗？

也许最简单的解决方案是在包装

IO

操作之前，先使用一个辅助函数来限制它们：

withConc :: QSem -> (a -> IO b) -> (a -> Concurrently b)
withConc sem f = \a -> Concurrently 
    (bracket_ (waitQSem sem) (signalQSem sem) (f a))

我们可以结合使用

withConc

对任务的任何

Traversable

容器执行限制并发遍历：

traverseThrottled :: Int -> (a -> IO b) -> [a] -> IO [b]
traverseThrottled concLevel action tasks = do
    sem <- newQSem concLevel
    runConcurrently (traverse (withConc sem action) tasks)

我建议使用或

并行交错

from。它（除其他外）具有这些属性

决不创建比池中指定的活动线程更多或更少的未阻止线程。注意：此计数包括并行执行的线程。这将最大限度地减少争用和抢占，同时还可以防止饥饿

返回时，已执行所有操作

一旦执行了所有操作，函数就会及时返回

您可以为它使用

monad par

，它与

async

一样，是由Simon Marlow制作的

例如：

import Control.Concurrent (threadDelay)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.Par.Combinator (parMapM)
import Control.Monad.Par.IO (runParIO)

download :: Int -> IO String
download i = do
  putStrLn ("downloading " ++ show i)
  threadDelay 1000000 -- sleep 1 second
  putStrLn ("downloading " ++ show i  ++ " finished")
  return "fake response"


main :: IO ()
main = do
  -- "pooled" mapM
  responses <- runParIO $ parMapM (\i -> liftIO $ download i) [1..10]
  mapM_ putStrLn responses

代码也可通过

获得，最新版本的具有用于池式并发的各种组合器

从另一个anwser改编Niklas的代码，您可以使用

unlifio

像这样：

#!/usr/bin/env stack
-- stack --resolver lts-14.11 --install-ghc runghc --package unliftio --package say
{-# LANGUAGE OverloadedStrings #-}

import Control.Concurrent (threadDelay)
import Say
import UnliftIO.Async

download :: Int -> IO String
download i = do
  sayString ("downloading " ++ show i)
  threadDelay 1000000 -- sleep 1 second
  sayString ("downloading " ++ show i ++ " finished")
  return "fake response"

main :: IO ()
main = do
  responses <- pooledMapConcurrentlyN 5 download [1 .. 5]
  print responses

谢谢，这个答案看起来很棒。这些线程肯定比使用

xargs-P

启动多个Haskell进程要便宜，这非常有效。一个相关的新手问题：以某种方式仅导入异步中使用的函数/类型将不起作用，即导入控制.Concurrent.async（Concurrent，runconcurrent）将产生一些奇怪的错误，抱怨找不到并发。关于这个问题有什么线索吗？顺便说一句，concurrent看起来像一个函数，因为它点缀着throttleAction，但它有大写的起始字母？@KevinZhu在Haskell中，有些类型具有与该类型同名的值构造函数。构造函数以大写开头，可以用作函数。类型

concurrent

具有值构造函数

concurrent:：IO a->concurrent a

。这就是由

throttledAction

组成的。为了导入构造函数，而不仅仅是类型，您需要编写

import-Control.Concurrent.Async（Concurrent（Concurrent）、runconcurrent）

或虚线速记

import-Control.Concurrent.Async（Concurrent（..）

导入类型中定义的所有构造函数和访问器。@KevinZhu如果需要预打包的解决方案，则来自“unlifio”包的

pooledMapConcurrentlyN

类似于

traverseThrottled

，但使用线程池来限制并发性，而不是一次性生成所有线程，然后使用信号量。我的用法并不是真正针对N个内核上的N个工作线程的并行性，而是并发性，在这种情况下，HTTP请求可能需要随机的时间来完成，并且在等待网络时处理器处于空闲状态。

parallel io

是否也适用于后一种情况？@dan是的，

Pool

定义了要使用的工人数量，然后他们处理一个给定的n个任务的列表，这样每次m个工人都在执行。如果我使用并行io，我是否总是要用额外的标志

-线程化并提供+RTS-N2-RTS

来编译它？这似乎有点烦人。@dan如果在运行时给出

-RTS

不方便，您可以在编译时使用GHC的

-with-rtsopts

选项进行设置。有点相关，所以问题是：您可以通过RTS参数以外的其他参数来控制并行量吗？@phadej对于这种情况，Petr用并行io的回答似乎很合适。

downloading 10
downloading 9
downloading 9 finished
downloading 10 finished
downloading 8
downloading 7
downloading 8 finished
downloading 7 finished
downloading 6
downloading 5
downloading 6 finished
downloading 5 finished
downloading 4
downloading 3
downloading 4 finished
downloading 3 finished
downloading 2
downloading 1
downloading 2 finished
downloading 1 finished
fake response
fake response
fake response
fake response
fake response
fake response
fake response
fake response
fake response
fake response

#!/usr/bin/env stack
-- stack --resolver lts-14.11 --install-ghc runghc --package unliftio --package say
{-# LANGUAGE OverloadedStrings #-}

import Control.Concurrent (threadDelay)
import Say
import UnliftIO.Async

download :: Int -> IO String
download i = do
  sayString ("downloading " ++ show i)
  threadDelay 1000000 -- sleep 1 second
  sayString ("downloading " ++ show i ++ " finished")
  return "fake response"

main :: IO ()
main = do
  responses <- pooledMapConcurrentlyN 5 download [1 .. 5]
  print responses

$ stack pooled.hs
downloading 1
downloading 2
downloading 3
downloading 4
downloading 5
downloading 1 finished
downloading 5 finished
downloading 2 finished
downloading 3 finished
downloading 4 finished
["fake response","fake response","fake response","fake response","fake response"]