List Haskell：扫描列表并为每个元素应用不同的函数_List_Haskell_Functional Programming

List Haskell：扫描列表并为每个元素应用不同的函数

list haskell functional-programming

List Haskell：扫描列表并为每个元素应用不同的函数,list,haskell,functional-programming,List,Haskell,Functional Programming,我需要扫描文档并为文件中的每个字符串累积不同函数的输出。在文件的任何给定行上运行的函数取决于该行中的内容通过对每个我想收集的列表进行完整的遍历，我可以非常低效地做到这一点。伪代码示例： at :: B.ByteString -> Maybe Atom at line | line == ATOM record = do stuff to return Just Atom | otherwise = Nothing ot :: B.ByteString -> May

我需要扫描文档并为文件中的每个字符串累积不同函数的输出。在文件的任何给定行上运行的函数取决于该行中的内容

通过对每个我想收集的列表进行完整的遍历，我可以非常低效地做到这一点。伪代码示例：

at :: B.ByteString -> Maybe Atom
at line
    | line == ATOM record = do stuff to return Just Atom
    | otherwise = Nothing

ot :: B.ByteString -> Maybe Sheet
ot line
    | line == SHEET record = do other stuff to return Just Sheet
    | otherwise = Nothing

然后，我将这些函数映射到文件中的整个行列表，以获得原子和图纸的完整列表：

mapper :: [B.ByteString] -> IO ()
mapper lines = do
    let atoms = mapMaybe at lines
    let sheets = mapMaybe to lines
    -- Do stuff with my atoms and sheets

但是，这是低效的，因为我正在为试图创建的每个列表映射整个字符串列表。相反，我只想在行字符串列表中映射一次，在移动时识别每一行，然后应用适当的函数并将这些值存储在不同的列表中

我的C心态想要这样做（伪代码）：

哈斯克尔的做法是什么？我根本无法让我的函数式编程思维想出解决方案

如果您只有两种选择，那么使用

或可能是一个好主意。在这种情况下，组合函数，映射列表，并使用左键和右键获得结果：
import Data.Either

-- first sample function, returning String
f1 x = show $ x `div` 2

-- second sample function, returning Int
f2 x = 3*x+1

-- combined function returning Either String Int
hotpo x = if even x then Left (f1 x) else Right (f2 x)

xs = map hotpo [1..10] 
-- [Right 4,Left "1",Right 10,Left "2",Right 16,Left "3",Right 22,Left "4",Right 28,Left "5"]

lefts xs 
-- ["1","2","3","4","5"]

rights xs
-- [4,10,16,22,28]

我展示了两种类型的线的解决方案，但是通过使用五元组而不是两元组，可以很容易地将其扩展到五种类型的线
import Data.Monoid

eachLine :: B.ByteString -> ([Atom], [Sheet])
eachLine bs | isAnAtom bs = ([ {- calculate an Atom -} ], [])
            | isASheet bs = ([], [ {- calculate a Sheet -} ])
            | otherwise = error "eachLine"

allLines :: [B.ByteString] -> ([Atom], [Sheet])
allLines bss = mconcat (map eachLine bss)

魔法是由来自（GHC附带）的mconcat
完成的
（就风格而言：我个人会定义一个Line
类型，一个parseLine:：B.ByteString->Line
函数，然后编写eachLine bs=case parseLine bs of…
。但这与你的问题无关。）引入一个新的ADT是一个好主意，例如“Summary”而不是元组。
然后，因为您想累积Summary的值，所以将其设置为Data.Monoid的距离。然后，借助分类器函数（例如isAtom、isSheet等）对每一行进行分类，并使用Monoid的mconcat函数（如@dave4420所建议的）将它们连接在一起
下面是代码（它使用String而不是ByteString，但很容易更改）：
首先，我认为其他人提供的答案至少在95%的情况下有效。通过使用适当的数据类型（或在某些情况下使用元组）为手头的问题编写代码始终是一种良好的做法。然而，有时你确实事先不知道你在列表中寻找什么，在这些情况下，试图列举所有可能性是困难的/耗时的/容易出错的。或者，您正在编写相同类型的多个变体（手动将多个折叠内联到一个中），并且希望捕获抽象
幸运的是，有一些技术可以提供帮助
框架解决方案
（有点自我宣传）
首先，各种“迭代器/枚举器”包通常提供处理此类问题的函数。我最熟悉的是，它可以让您执行以下操作：
import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Maybe

-- first, you'll need some way to process the Atoms/Sheets/etc. you're getting
-- if you want to just return them as a list, you can use the built-in
-- stream2list function

-- next, create stream transformers
-- given at :: B.ByteString -> Maybe Atom
-- create a stream transformer from ByteString lines to Atoms
atIter :: Enumeratee [B.ByteString] [Atom] m a
atIter = I.mapChunks (catMaybes . map at)

otIter :: Enumeratee [B.ByteString] [Sheet] m a
otIter = I.mapChunks (catMaybes . map ot)

-- finally, combine multiple processors into one
-- if you have more than one processor, you can use zip3, zip4, etc.
procFile :: Iteratee [B.ByteString] m ([Atom],[Sheet])
procFile = I.zip (atIter =$ stream2list) (otIter =$ stream2list)

-- and run it on some data
runner :: FilePath -> IO ([Atom],[Sheet])
runner filename = do
  resultIter <- enumFile defaultBufSize filename $= enumLinesBS $ procFile
  run resultIter

这样做的结果与单个for循环不同——这仍然会执行多个数据遍历。然而，遍历模式已经改变。这将一次加载一定量的数据（defaultBufSize
bytes），并多次遍历该块，根据需要存储部分结果。在一个块被完全消耗之后，下一个块被加载，旧的块可以被垃圾收集
希望这能证明两者的区别：
Data.List.zip:
  x1 x2 x3 .. x_n
                   x1 x2 x3 .. x_n

Data.Iteratee.zip:
  x1 x2      x3 x4      x_n-1 x_n
       x1 x2      x3 x4           x_n-1 x_n

如果你做了足够多的工作，并行性是有意义的，这根本不是问题。由于内存的局部性，性能要比对整个输入进行多次遍历要好得多，因为Data.List.zip
将产生更高的性能
美丽的解决方案
如果单个遍历解决方案确实最有意义，那么您可能会对Max Rabkin和Conal Elliott（）的文章感兴趣。基本思想是，您可以创建数据结构来表示折叠和拉链，通过组合这些数据结构，您可以创建一个新的、组合的折叠/拉链函数，该函数只需要一次遍历。对于Haskell初学者来说，这可能有点高级，但是由于您正在思考这个问题，您可能会发现它有趣或有用。Max的帖子可能是最好的起点。
这几乎是正确的，只需将atoms
和sheets
作为累加器变量传递，并在末尾作为tuple返回。我不确定我是否遵循。我上面的伪代码在Haskell中毫无意义。Haskell不仅没有for循环，而且do构造中也没有保护。此外，我对返回元组不感兴趣。上面显示的是一个示例。如果我想返回100种不同类型的列表怎么办？这就是为什么我说几乎是。如果您不知道需要生成的列表的数量和种类，则无法在一个函数中完成。就这么简单。记住，函数需要有一个类型。如果，OTOH，你知道一个提取函数总是[a]列表，而所有提取函数都是[a]列表，那么你可以返回列表列表或列表映射。难道你不能拥有data ParseResult=PAtom Atom | PSheet Sheet
，映射aB.ByteString->ParseResult
，然后定义（patoms，rest）=partition isAtom parseRes
和atoms=fromPAtom patoms
和sheets=fromPSheet rest？@Ptival:你的答案真的很有趣。你能充实一下吗？我非常喜欢使用条件数据类型的想法，但我不确定您是如何使用它们的。不。我需要返回的列表类型的数量不会限制为2。这只是一个例子。目前，我实际上想收集5种列表类型。一个n元组是非常可疑的。我非常希望避免这样的事情，但我现在没有其他选择。对于一段小的本地代码，一个n元组fine（imo）。如果它能在节目中得到更广泛的传播，我会定义
import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Maybe

-- first, you'll need some way to process the Atoms/Sheets/etc. you're getting
-- if you want to just return them as a list, you can use the built-in
-- stream2list function

-- next, create stream transformers
-- given at :: B.ByteString -> Maybe Atom
-- create a stream transformer from ByteString lines to Atoms
atIter :: Enumeratee [B.ByteString] [Atom] m a
atIter = I.mapChunks (catMaybes . map at)

otIter :: Enumeratee [B.ByteString] [Sheet] m a
otIter = I.mapChunks (catMaybes . map ot)

-- finally, combine multiple processors into one
-- if you have more than one processor, you can use zip3, zip4, etc.
procFile :: Iteratee [B.ByteString] m ([Atom],[Sheet])
procFile = I.zip (atIter =$ stream2list) (otIter =$ stream2list)

-- and run it on some data
runner :: FilePath -> IO ([Atom],[Sheet])
runner filename = do
  resultIter <- enumFile defaultBufSize filename $= enumLinesBS $ procFile
  run resultIter

import Data.Iteratee.Parallel

parProcFile = I.zip (parI $ atIter =$ stream2list) (parI $ otIter =$ stream2list)

Data.List.zip:
  x1 x2 x3 .. x_n
                   x1 x2 x3 .. x_n

Data.Iteratee.zip:
  x1 x2      x3 x4      x_n-1 x_n
       x1 x2      x3 x4           x_n-1 x_n