Haskell/conductor：逐行读取文件_Haskell_Conduit

Haskell/conductor：逐行读取文件

haskell

Haskell/conductor：逐行读取文件,haskell,conduit,Haskell,Conduit,场景：我有一个约900mb的文本文件，格式如下 ... Id: 109101 ASIN: 0806978473 title: The Beginner's Guide to Tai Chi group: Book salesrank: 672264 similar: 0 categories: 3 |Books[283155]|Subjects[1000]|Sports[26]|Individual Sports[16533]|Martial Arts[16571]

场景：我有一个约900mb的文本文件，格式如下

...
Id:   109101
ASIN: 0806978473
  title: The Beginner's Guide to Tai Chi
  group: Book
  salesrank: 672264
  similar: 0
  categories: 3
   |Books[283155]|Subjects[1000]|Sports[26]|Individual Sports[16533]|Martial Arts[16571]|General[16575]
   |Books[283155]|Subjects[1000]|Sports[26]|Individual Sports[16533]|Martial Arts[16571]|Taichi[16583]
   |Books[283155]|Subjects[1000]|Sports[26]|General[11086921]
  reviews: total: 2  downloaded: 2  avg rating: 5
    2000-4-4  cutomer: A191SV1V1MK490  rating: 5  votes:   0  helpful:   0
    2004-7-10  cutomer:  AVXBUEPNVLZVC  rating: 5  votes:   0  helpful:   0
                    (----- empty line ------)    
Id :

并希望从中解析信息

问题：作为第一步（因为我需要它用于另一个上下文），我想逐行处理文件，然后收集属于一个产品的“块”，然后用其他逻辑分别处理它们

因此，计划如下：

定义表示文本文件的源

定义一条导管（？），每条导管从该源引出一条线，然后

。。。将其传递给其他一些组件

现在，我尝试改编以下示例：

doStuff=do
writeFile“input.txt”“这是一个\n测试。”--文件路径->字符串->IO（）
runconductres--mr
$sourceFileBS“input.txt”-conduit i by testring m（）--by“chunk”
.| sinkFile“output.txt”--FilePath->conduit by testring o m（）
读取文件“output.txt”
>>=putStrLn

因此

sourceFileBS“input.txt”

属于

conduit i ByteString m（）

类型，即具有

输入类型
```
i
```
输出类型
```
ByteStream
```
单子类型
```
t
```
结果类型
```
（）
```

sinkFile

将所有传入数据流到给定文件中

sinkFile“output.txt”

是一个输入类型为

ByteStream

的导管

我现在想要的是逐行处理输入源，也就是说，每个下游只传递一行。在伪代码中：

sourceFile "input.txt"
splitIntoLines
yieldMany (?)
other stuff

我该怎么做

我现在拥有的是

copyFile=do
writeFile“input.txt”“这是一个\n测试。”--文件路径->字符串->IO（）
runconductres--mr
（lineC$sourceFileBS“input.txt”）--conduit i by testring m（）--by“chunk”
.| sinkFile“output.txt”--FilePath->conduit by testring o m（）
读取文件“output.txt”
>>=putStrLn--

但这会产生以下类型错误：

    * Couldn't match type `bytestring-0.10.8.2:Data.ByteString.Internal.ByteString'
                     with `Void'
      Expected type: ConduitT
                       ()
                       Void
                       (ResourceT
                          (ConduitT
                             a0 bytestring-0.10.8.2:Data.ByteString.Internal.ByteString m0))
                       ()
        Actual type: ConduitT
                       ()
                       bytestring-0.10.8.2:Data.ByteString.Internal.ByteString
                       (ResourceT
                          (ConduitT
                             a0 bytestring-0.10.8.2:Data.ByteString.Internal.ByteString m0))
                       ()
    * In the first argument of `runConduitRes', namely
        `(lineC $ sourceFileBS "input.txt")'
      In the first argument of `(.|)', namely
        `runConduitRes (lineC $ sourceFileBS "input.txt")'
      In a stmt of a 'do' block:
        runConduitRes (lineC $ sourceFileBS "input.txt")
          .| sinkFile "output.txt"
   |
28 |     (lineC $ sourceFileBS "input.txt")   -- ConduitT i ByteString m ()  -- by "chunk"
   |      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

这使我相信，现在的问题是，线路中的第一个导管没有与

runconduiters

兼容的输入类型

我就是搞不懂，真的需要一个提示

提前多谢。

我今天正在努力解决这个问题，在试图解决类似问题时发现了这个问题。我试图将git日志分成块进行进一步解析，如

commit 12345
Author: Me
Date:   Thu Jan 25 13:45:16 2019 -0500

    made some changes

 1 file changed, 10 insertions(+), 0 deletions(-)

commit 54321
Author: Me
...and so on...

我需要的函数几乎是splitOnUnBounded，但我不太明白如何在那里编写谓词函数

我提出了以下

管道

，这是对

SplitonUnbound

的一个轻微修改。这将需要一系列的列表。每个列表有一行文本，因为我发现这样想比较容易，尽管这肯定不是最佳解决方案

它将使用一个函数将文本行分组在一起，该函数取下一行并返回一个

Bool

，指示下一行是否为下一组文本的开始


组行：：（Monad m，MonadIO m）=>（Text->Bool）->[T.Text]->ConduitM Text[Text]m（）
groupLines startNextLine ls=开始
哪里
--如果流中的下一行为Nothing，则返回。
--如果下一行是流，那么
--把那条线累加起来
开始=等待>>=可能（返回（））（累计ls）
累计ls nextLine=do
--如果ls为[]，则添加下一行。试着换一条新的线。如果没有，就放弃。如果有下一行，
--收益线和通话再次累积。
--如果ls为[Text]，则检查nextLine是否为下一组的开始。如果不是，请将nextLine添加到ls，
--试试看下一条线路。如果没有，收益率，如果有，再次调用累积行。
--如果nextLine_u是下一组的开始，则生成这组行并再次调用累积行。
下一行'
如果Prelude.null ls
然后再积累起来
其他的
如果开始下一行
然后产生ls'>>累加数[]l
否则就累加了
哪里
ls'=ls++[nextLine]

它可以在如下导管中使用。只需将函数传递到

Text->Bool

函数上方，该函数告诉导管下一个文本集合何时开始


isCommitLine:：Text->Bool
isCommitLine t=listtomabe（TS.index“commit”t）=仅0
日志分析器=
源文件“logs.txt”
.|解码UTF8
.|线晒黑
.| groupLines是CommitLine[]
.| Data.conduct.combinates.map（插入“\n”）
--对这里的每个日志条目执行一些操作--
.| Data.conduct.Combinators.print
main:：IO（）
main=runconductres日志解析器

我是Haskell的新手，强烈怀疑这不是实现这一目标的最佳方式。所以如果其他人有更好的建议，我会很乐意学习！否则，在这里发布此解决方案可能会对其他人有所帮助。

我今天正在努力解决此问题，在尝试解决类似问题时发现了此问题。我试图将git日志分成块进行进一步解析，如

commit 12345
Author: Me
Date:   Thu Jan 25 13:45:16 2019 -0500

    made some changes

 1 file changed, 10 insertions(+), 0 deletions(-)

commit 54321
Author: Me
...and so on...