为什么我修改过的（真实世界的haskell）Mapreduce实现失败了；“打开的文件太多”；_Mapreduce_Haskell

为什么我修改过的（真实世界的haskell）Mapreduce实现失败了；“打开的文件太多”；

mapreduce haskell

为什么我修改过的（真实世界的haskell）Mapreduce实现失败了；“打开的文件太多”；,mapreduce,haskell,Mapreduce,Haskell,我正在实现一个haskell程序，它将文件的每一行与文件中的每一行进行比较。对于辛性，我们假设由一行表示的数据结构只是一个Int，我的算法是平方距离。我将按以下方式实施： --My operation distance :: Int -> Int -> Int distance a b = (a-b)*(a-b) combineDistances :: [Int] -> Int combineDistances = sum --Applying my operation s

我正在实现一个haskell程序，它将文件的每一行与文件中的每一行进行比较。对于辛性，我们假设由一行表示的数据结构只是一个Int，我的算法是平方距离。我将按以下方式实施：

--My operation
distance :: Int -> Int -> Int
distance a b = (a-b)*(a-b)

combineDistances :: [Int] -> Int
combineDistances = sum

--Applying my operation simply on a file
sumOfDistancesOnSmallFile :: FilePath -> IO Int
sumOfDistancesOnSmallFile path = do
              fileContents <- readFile path
              return $ allDistances $ map read $ lines $ fileContents
              where
                  allDistances (x:xs) = (allDistances xs) + ( sum $ map (distance x) xs)
                  allDistances _ = 0

--Test file generation
createTestFile :: Int -> FilePath -> IO ()
createTestFile n path = writeFile path $ unlines $ map show $ take n $ infiniteList 0 1
    where infiniteList :: Int->Int-> [Int]
          infiniteList i j = (i + j) : infiniteList j (i+j)

完整的实现是（加上前面的代码区域）

import qualified Data.ByteString.Lazy.Char8作为Lazy导入
导入控制。异常（括号，最后）
进口管制.Monad（表格，liftM）
进口管制.平行战略
进口管制
import Control.DeepSeq（NFData）
导入Data.Int（Int64）
导入系统.IO
--在一个非常大的文件上使用mapreduce应用我的操作
sumOfDistancesOnFile:：FilePath->IO Int
sumOfDistancesOnFile path=chunkedFileOperation chunkByLinesTails（距离使用MapReduce）路径
距离使用MapReduce:：[Lazy.ByteString]->Int
distancesUsingMapReduce=mapReduce rpar（distancesFirstToTail.lexer）
rpar组合状态
其中lexer:：Lazy.ByteString->[Int]
lexer chunk=map（read.Lazy.unpack）（Lazy.lines chunk）
距离SoneToMany:：Int->[Int]->Int
距离SONETOMANY one many=组合位置$map（距离1）many
距离firsttotail:：[Int]->Int
距离第一至第二轨道s=
如果不是（null s）
然后距离到许多（头部s）（尾部s）
其他0
--mapreduce算法
mapReduce:：策略b——映射的评估策略
->（a->b）--映射功能
->战略c——减排评估战略
->（[b]->c）--减少功能
->[a]--要映射的列表
->c
mapReduce mapStrat mapFunc reduceStrat reduceFunc输入=
mapResult`pseq`reduceResult
其中mapResult=parMap mapStrat mapFunc input
reduceResult=reduceFunc mapResult`using`reduceResult`
--使用（文件）块：
数据块spec=CS{
chunkOffset:：！Int64
，chunkLength:：！Int64
}推导（等式，显示）
chunkedFileOperation:：（NFData）=>
（文件路径->IO[ChunkSpec]）
->（[Lazy.ByteString]->a）
->文件路径
->木卫一
chunkedFileOperation chunkCreator funcOnChunks path=do
（块、句柄）IO[ChunkSpec]）
->文件路径
->IO（[Lazy.ByteString]，[Handle]）
chunkedRead chunkCreator path=do
大块的
h do
totalSize此错误的意思正是：进程打开的文件太多。操作系统对进程可以同时读取的文件（或目录）数量施加了任意限制。查看您的ulimit（1）
手册页和/或限制映射程序的数量。
您使用的是惰性IO，因此使用readFile
打开的文件不会及时关闭。您需要考虑一种定期显式关闭文件的解决方案（例如，通过严格IO或iteratee IO）。在堆栈溢出时，不应发布签名。您的帐户名显示在问题下方的框中。使用以下导入：将限定数据导入。ByteString.Char8作为惰性，并使用以下区块：Data ChunkSpec=CS{chunkOffset:：！Int，chunkLength:：！Int}派生（Eq，Show）该程序可以在大文件上运行。我最初尝试使用iteratee IO时不幸失败，对此我请求了帮助，因此我应用了两种降低内存的解决方案。但是，是懒惰IO导致mapReduce的pseq产生了太多线程，还是延迟了finally-hClose in chunkedFileOperation。书中的这一部分示例对我来说并不清楚，因为我读到最后“释放所有句柄”，而不是释放它刚刚释放到的每个句柄。不，没有太多线程。打开的文件太多（因为GC决定不需要这些文件时，它们被延迟关闭）。我知道文件句柄的数量有限制，但我希望我的算法最多使用ghc创建的线程数量，这应该非常低（至少，这是实现的目的）
tails . lines. readFile

import qualified Data.ByteString.Lazy.Char8 as Lazy
import Control.Exception (bracket,finally)
import Control.Monad(forM,liftM)
import Control.Parallel.Strategies
import Control.Parallel
import Control.DeepSeq (NFData)
import Data.Int (Int64)
import System.IO

--Applying my operation using mapreduce on a very big file
sumOfDistancesOnFile :: FilePath -> IO Int
sumOfDistancesOnFile path = chunkedFileOperation chunkByLinesTails (distancesUsingMapReduce) path

distancesUsingMapReduce :: [Lazy.ByteString] -> Int
distancesUsingMapReduce = mapReduce rpar (distancesFirstToTail . lexer)
                                rpar combineDistances
              where lexer :: Lazy.ByteString -> [Int]
                    lexer chunk = map (read . Lazy.unpack) (Lazy.lines chunk)

distancesOneToMany :: Int -> [Int] -> Int
distancesOneToMany one many = combineDistances $ map (distance one) many

distancesFirstToTail :: [Int] -> Int
distancesFirstToTail s = 
              if not (null s)
              then distancesOneToMany (head s) (tail s)
              else 0
--The mapreduce algorithm
mapReduce :: Strategy b -- evaluation strategy for mapping
      -> (a -> b)   -- map function
      -> Strategy c -- evaluation strategy for reduction
      -> ([b] -> c) -- reduce function
      -> [a]        -- list to map over
      -> c
mapReduce mapStrat mapFunc reduceStrat reduceFunc input =
      mapResult `pseq` reduceResult
      where mapResult    = parMap mapStrat mapFunc input
            reduceResult = reduceFunc mapResult `using` reduceStrat


--Working with (file)chunks:
data ChunkSpec = CS{
    chunkOffset :: !Int64
    , chunkLength :: !Int64
    } deriving (Eq,Show)

chunkedFileOperation ::   (NFData a)=>
            (FilePath-> IO [ChunkSpec])
       ->   ([Lazy.ByteString]-> a)
       ->   FilePath
       ->   IO a
chunkedFileOperation chunkCreator funcOnChunks path = do
    (chunks, handles)<- chunkedRead chunkCreator path
    let r = funcOnChunks chunks
    (rdeepseq r `seq` return r) `finally` mapM_ hClose handles

chunkedRead ::  (FilePath -> IO [ChunkSpec])
        ->  FilePath
        ->  IO ([Lazy.ByteString], [Handle])
chunkedRead chunkCreator path = do
    chunks <- chunkCreator path
    liftM unzip . forM chunks $ \spec -> do
    h <- openFile path ReadMode
    hSeek h AbsoluteSeek (fromIntegral (chunkOffset spec))
    chunk <- Lazy.take (chunkLength spec) `liftM` Lazy.hGetContents h
    return (chunk,h)

-- returns set of chunks representing  tails . lines . readFile 
chunkByLinesTails :: FilePath -> IO[ChunkSpec]
chunkByLinesTails path = do
    bracket (openFile path ReadMode) hClose $ \h-> do
        totalSize <- fromIntegral `liftM` hFileSize h
        let chunkSize = 1
            findChunks offset = do
            let newOffset = offset + chunkSize
            hSeek h AbsoluteSeek (fromIntegral newOffset)
            let findNewline lineSeekOffset = do
                eof <- hIsEOF h
                if eof
                    then return [CS offset (totalSize - offset)]
                    else do
                        bytes <- Lazy.hGet h 4096
                        case Lazy.elemIndex '\n' bytes of
                            Just n -> do
                                nextChunks <- findChunks (lineSeekOffset + n + 1)
                                return (CS offset (totalSize-offset):nextChunks)
                            Nothing -> findNewline (lineSeekOffset + Lazy.length bytes)
            findNewline newOffset
        findChunks 0