Haskell 海龟：处理非utf8输入_Haskell_Character Encoding_Haskell Pipes_Haskell Turtle

Haskell 海龟：处理非utf8输入

haskell character-encoding

Haskell 海龟：处理非utf8输入,haskell,character-encoding,haskell-pipes,haskell-turtle,Haskell,Character Encoding,Haskell Pipes,Haskell Turtle,在学习管道的过程中，我在处理非utf8文件时遇到了问题。这就是为什么我绕道进入海龟库，试图在更高的抽象层次上理解如何解决那里的问题我想做的练习非常简单：找到从给定目录可以访问的所有常规文件的所有行的总和。这可以通过以下shell命令轻松实现： find $FPATH -type f -print | xargs cat | wc -l 我提出了以下解决方案： import qualified Control.Foldl as F import qualified Turtle

在学习管道的过程中，我在处理非utf8文件时遇到了问题。这就是为什么我绕道进入海龟库，试图在更高的抽象层次上理解如何解决那里的问题

我想做的练习非常简单：找到从给定目录可以访问的所有常规文件的所有行的总和。这可以通过以下shell命令轻松实现：

find $FPATH -type f -print | xargs cat | wc -l

我提出了以下解决方案：

import qualified Control.Foldl as F
import qualified Turtle        as T

-- | Returns true iff the file path is not a symlink.
noSymLink :: T.FilePath -> IO Bool
noSymLink fPath = (not . T.isSymbolicLink) <$> T.stat fPath

-- | Shell that outputs the regular files in the given directory.
regularFilesIn :: T.FilePath -> T.Shell T.FilePath
regularFilesIn fPath = do
  fInFPath <- T.lsif noSymLink fPath
  st <- T.stat fInFPath
  if T.isRegularFile st
    then return fInFPath
    else T.empty

-- | Read lines of `Text` from all the regular files under the given directory
-- path.
inputDir :: T.FilePath -> T.Shell T.Line
inputDir fPath = do
  file <- regularFilesIn fPath
  T.input file

-- | Print the number of lines in all the files in a directory.
printLinesCountIn :: T.FilePath -> IO ()
printLinesCountIn fPath = do
  count <- T.fold (inputDir fPath) F.length
  print count

这是可以预期的，因为：

$ file -I test/resources/php_ext_syslog.h
test/resources/php_ext_syslog.h: text/x-c; charset=iso-8859-1

我想知道如何解决将不同编码读入

文本

的问题，以便程序能够处理这个问题。对于手头的问题，我想我可以避免转换为

文本

，但我更想知道如何做，因为你可以想象这样一种情况，例如，我想用某个目录下的所有单词建立一个集合

编辑

到目前为止，我能想到的唯一解决方案是：

mDecodeByteString :: T.Shell ByteString -> T.Shell T.Text
mDecodeByteString = gMDecodeByteString (streamDecodeUtf8With lenientDecode)
  where gMDecodeByteString :: (ByteString -> Decoding)
                             -> T.Shell ByteString
                             -> T.Shell T.Text
        gMDecodeByteString f bss = do
          bs <- bss
          let Some res bs' g = f bs
          if BS.null bs'
            then return res
            else gMDecodeByteString g bss

inputDir' :: T.FilePath -> T.Shell T.Line
inputDir' fPath = do
  file <- regularFilesIn fPath
  text <- mDecodeByteString (TB.input file)
  T.select (NE.toList $ T.textToLines text)

-- | Print the number of lines in all the files in a directory. Using a more
-- robust version of `inputDir`.
printLinesCountIn' :: T.FilePath -> IO ()
printLinesCountIn' fPath = do
  count <- T.fold (inputDir' fPath) T.countLines
  print count

mDecodeByteString:：T.Shell ByteString->T.Shell T.Text
mDecodeByteString=gMDecodeByteString（StreamDecodeOutF8with lenientDecode）
其中gMDecodeByteString:：（ByteString->Decoding）
->T.壳边试验环
->T.外壳T.文本
gMDecodeByteString f bss=do
bs反问：什么是线？听起来很明显，但你需要知道数一数是什么。您可以将整个文件视为字节，并说字节“10”的任何出现都是一个新行，因此您可以计算它们。如果这是你的目标，你最好使用ByteString
而不是Text
阅读。如果你想要更复杂的东西，你不能绕过别人告诉你一些关于文件中的信息，因为一般来说，你不能仅仅从看到它们就猜出字节的含义。事实上，在这种情况下，我可能会四处寻找对应于行终止符的字节，并通过testring直接使用s。但是，我想介绍一种情况，在这种情况下，我想尝试将文件作为字符文件（例如，如果我正在收集单词），然后我需要一个健壮的解决方案来将不同的编码解码为文本。在解码之前，您需要知道文件使用什么编码。你可以猜，但那很脆弱。如果幸运的话，文件的开头有一个字母，然后可能是Unicode，如果是，你甚至知道是哪个Unicode编码。但是bom不是必需的，即使您看到bom，也不一定意味着该文件是以某种unicode编码的。如果没有可用的文件元数据，就没有真正可靠的方法来获取文件编码。我想知道find$FPATH-type f-print | xargs cat | wc-l
如何处理这个问题…wc l没有处理这个问题。它只是在文件中查找字节“10”。“所以我不会事先知道编码”看，这就是问题所在。没有神奇的方法可以知道文件的编码。您可以尝试一些最常见的方法，看看它们是否给出了合理的结果，但一般来说，您无法确定某个任意blob是否使用了任意编码。这就是HTML中存在内容类型头的原因。
mDecodeByteString :: T.Shell ByteString -> T.Shell T.Text
mDecodeByteString = gMDecodeByteString (streamDecodeUtf8With lenientDecode)
  where gMDecodeByteString :: (ByteString -> Decoding)
                             -> T.Shell ByteString
                             -> T.Shell T.Text
        gMDecodeByteString f bss = do
          bs <- bss
          let Some res bs' g = f bs
          if BS.null bs'
            then return res
            else gMDecodeByteString g bss

inputDir' :: T.FilePath -> T.Shell T.Line
inputDir' fPath = do
  file <- regularFilesIn fPath
  text <- mDecodeByteString (TB.input file)
  T.select (NE.toList $ T.textToLines text)

-- | Print the number of lines in all the files in a directory. Using a more
-- robust version of `inputDir`.
printLinesCountIn' :: T.FilePath -> IO ()
printLinesCountIn' fPath = do
  count <- T.fold (inputDir' fPath) T.countLines
  print count