为什么我修改过的(真实世界的haskell)Mapreduce实现失败了;“打开的文件太多”;

为什么我修改过的(真实世界的haskell)Mapreduce实现失败了;“打开的文件太多”;,mapreduce,haskell,Mapreduce,Haskell,我正在实现一个haskell程序,它将文件的每一行与文件中的每一行进行比较。对于辛性,我们假设由一行表示的数据结构只是一个Int,我的算法是平方距离。我将按以下方式实施: --My operation distance :: Int -> Int -> Int distance a b = (a-b)*(a-b) combineDistances :: [Int] -> Int combineDistances = sum --Applying my operation s

我正在实现一个haskell程序,它将文件的每一行与文件中的每一行进行比较。对于辛性,我们假设由一行表示的数据结构只是一个Int,我的算法是平方距离。我将按以下方式实施:

--My operation
distance :: Int -> Int -> Int
distance a b = (a-b)*(a-b)

combineDistances :: [Int] -> Int
combineDistances = sum

--Applying my operation simply on a file
sumOfDistancesOnSmallFile :: FilePath -> IO Int
sumOfDistancesOnSmallFile path = do
              fileContents <- readFile path
              return $ allDistances $ map read $ lines $ fileContents
              where
                  allDistances (x:xs) = (allDistances xs) + ( sum $ map (distance x) xs)
                  allDistances _ = 0

--Test file generation
createTestFile :: Int -> FilePath -> IO ()
createTestFile n path = writeFile path $ unlines $ map show $ take n $ infiniteList 0 1
    where infiniteList :: Int->Int-> [Int]
          infiniteList i j = (i + j) : infiniteList j (i+j)
完整的实现是(加上前面的代码区域)

import qualified Data.ByteString.Lazy.Char8作为Lazy导入
导入控制。异常(括号,最后)
进口管制.Monad(表格,liftM)
进口管制.平行战略
进口管制
import Control.DeepSeq(NFData)
导入Data.Int(Int64)
导入系统.IO
--在一个非常大的文件上使用mapreduce应用我的操作
sumOfDistancesOnFile::FilePath->IO Int
sumOfDistancesOnFile path=chunkedFileOperation chunkByLinesTails(距离使用MapReduce)路径
距离使用MapReduce::[Lazy.ByteString]->Int
distancesUsingMapReduce=mapReduce rpar(distancesFirstToTail.lexer)
rpar组合状态
其中lexer::Lazy.ByteString->[Int]
lexer chunk=map(read.Lazy.unpack)(Lazy.lines chunk)
距离SoneToMany::Int->[Int]->Int
距离SONETOMANY one many=组合位置$map(距离1)many
距离firsttotail::[Int]->Int
距离第一至第二轨道s=
如果不是(null s)
然后距离到许多(头部s)(尾部s)
其他0
--mapreduce算法
mapReduce::策略b——映射的评估策略
->(a->b)--映射功能
->战略c——减排评估战略
->([b]->c)--减少功能
->[a]--要映射的列表
->c
mapReduce mapStrat mapFunc reduceStrat reduceFunc输入=
mapResult`pseq`reduceResult
其中mapResult=parMap mapStrat mapFunc input
reduceResult=reduceFunc mapResult`using`reduceResult`
--使用(文件)块:
数据块spec=CS{
chunkOffset::!Int64
,chunkLength::!Int64
}推导(等式,显示)
chunkedFileOperation::(NFData)=>
(文件路径->IO[ChunkSpec])
->([Lazy.ByteString]->a)
->文件路径
->木卫一
chunkedFileOperation chunkCreator funcOnChunks path=do
(块、句柄)IO[ChunkSpec])
->文件路径
->IO([Lazy.ByteString],[Handle])
chunkedRead chunkCreator path=do
大块的
h do

totalSize此错误的意思正是:进程打开的文件太多。操作系统对进程可以同时读取的文件(或目录)数量施加了任意限制。查看您的
ulimit(1)
手册页和/或限制映射程序的数量。

您使用的是惰性IO,因此使用
readFile
打开的文件不会及时关闭。您需要考虑一种定期显式关闭文件的解决方案(例如,通过严格IO或iteratee IO)。

在堆栈溢出时,不应发布签名。您的帐户名显示在问题下方的框中。使用以下导入:将限定数据导入。ByteString.Char8作为惰性,并使用以下区块:Data ChunkSpec=CS{chunkOffset::!Int,chunkLength::!Int}派生(Eq,Show)该程序可以在大文件上运行。我最初尝试使用iteratee IO时不幸失败,对此我请求了帮助,因此我应用了两种降低内存的解决方案。但是,是懒惰IO导致mapReduce的pseq产生了太多线程,还是延迟了finally-hClose in chunkedFileOperation。书中的这一部分示例对我来说并不清楚,因为我读到最后“释放所有句柄”,而不是释放它刚刚释放到的每个句柄。不,没有太多线程。打开的文件太多(因为GC决定不需要这些文件时,它们被延迟关闭)。我知道文件句柄的数量有限制,但我希望我的算法最多使用ghc创建的线程数量,这应该非常低(至少,这是实现的目的)
tails . lines. readFile
import qualified Data.ByteString.Lazy.Char8 as Lazy
import Control.Exception (bracket,finally)
import Control.Monad(forM,liftM)
import Control.Parallel.Strategies
import Control.Parallel
import Control.DeepSeq (NFData)
import Data.Int (Int64)
import System.IO

--Applying my operation using mapreduce on a very big file
sumOfDistancesOnFile :: FilePath -> IO Int
sumOfDistancesOnFile path = chunkedFileOperation chunkByLinesTails (distancesUsingMapReduce) path

distancesUsingMapReduce :: [Lazy.ByteString] -> Int
distancesUsingMapReduce = mapReduce rpar (distancesFirstToTail . lexer)
                                rpar combineDistances
              where lexer :: Lazy.ByteString -> [Int]
                    lexer chunk = map (read . Lazy.unpack) (Lazy.lines chunk)

distancesOneToMany :: Int -> [Int] -> Int
distancesOneToMany one many = combineDistances $ map (distance one) many

distancesFirstToTail :: [Int] -> Int
distancesFirstToTail s = 
              if not (null s)
              then distancesOneToMany (head s) (tail s)
              else 0
--The mapreduce algorithm
mapReduce :: Strategy b -- evaluation strategy for mapping
      -> (a -> b)   -- map function
      -> Strategy c -- evaluation strategy for reduction
      -> ([b] -> c) -- reduce function
      -> [a]        -- list to map over
      -> c
mapReduce mapStrat mapFunc reduceStrat reduceFunc input =
      mapResult `pseq` reduceResult
      where mapResult    = parMap mapStrat mapFunc input
            reduceResult = reduceFunc mapResult `using` reduceStrat


--Working with (file)chunks:
data ChunkSpec = CS{
    chunkOffset :: !Int64
    , chunkLength :: !Int64
    } deriving (Eq,Show)

chunkedFileOperation ::   (NFData a)=>
            (FilePath-> IO [ChunkSpec])
       ->   ([Lazy.ByteString]-> a)
       ->   FilePath
       ->   IO a
chunkedFileOperation chunkCreator funcOnChunks path = do
    (chunks, handles)<- chunkedRead chunkCreator path
    let r = funcOnChunks chunks
    (rdeepseq r `seq` return r) `finally` mapM_ hClose handles

chunkedRead ::  (FilePath -> IO [ChunkSpec])
        ->  FilePath
        ->  IO ([Lazy.ByteString], [Handle])
chunkedRead chunkCreator path = do
    chunks <- chunkCreator path
    liftM unzip . forM chunks $ \spec -> do
    h <- openFile path ReadMode
    hSeek h AbsoluteSeek (fromIntegral (chunkOffset spec))
    chunk <- Lazy.take (chunkLength spec) `liftM` Lazy.hGetContents h
    return (chunk,h)

-- returns set of chunks representing  tails . lines . readFile 
chunkByLinesTails :: FilePath -> IO[ChunkSpec]
chunkByLinesTails path = do
    bracket (openFile path ReadMode) hClose $ \h-> do
        totalSize <- fromIntegral `liftM` hFileSize h
        let chunkSize = 1
            findChunks offset = do
            let newOffset = offset + chunkSize
            hSeek h AbsoluteSeek (fromIntegral newOffset)
            let findNewline lineSeekOffset = do
                eof <- hIsEOF h
                if eof
                    then return [CS offset (totalSize - offset)]
                    else do
                        bytes <- Lazy.hGet h 4096
                        case Lazy.elemIndex '\n' bytes of
                            Just n -> do
                                nextChunks <- findChunks (lineSeekOffset + n + 1)
                                return (CS offset (totalSize-offset):nextChunks)
                            Nothing -> findNewline (lineSeekOffset + Lazy.length bytes)
            findNewline newOffset
        findChunks 0