C 通过测试环构造技巧提高速度
我是哈斯凯尔的新手,我一直在处理效率问题 任务是:从4GB文本文件构建CSV文件,其中列的大小是恒定的 列大小是已知的,例如[col1:4个字符宽,col2:2个字符宽,等等…C 通过测试环构造技巧提高速度,c,performance,haskell,bytestring,C,Performance,Haskell,Bytestring,我是哈斯凯尔的新手,我一直在处理效率问题 任务是:从4GB文本文件构建CSV文件,其中列的大小是恒定的 列大小是已知的,例如[col1:4个字符宽,col2:2个字符宽,等等… 文件只能包含[A-Z0-9]ASCII字符,因此转义单元格没有意义 I have: $ cat example.txt AAAABBCCCC... AAA1B1CCC1... ... (72 chars per line, usually 50 mln lines) I need: $ cat done.cs
文件只能包含[A-Z0-9]ASCII字符,因此转义单元格没有意义
I have:
$ cat example.txt
AAAABBCCCC...
AAA1B1CCC1...
... (72 chars per line, usually 50 mln lines)
I need:
$ cat done.csv
AAAA,BB,CCCC, ...
AAA1,B1,CCC1, ...
...
这是我在Haskell最快的代码,
处理整个4GB文件大约需要2分钟。我最多需要30秒
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString as B
import qualified Data.ByteString.Unsafe as U
import Data.ByteString.Lazy.Builder
import Data.Monoid
import Data.List
col_sizes = intercalate [1] $ map (`replicate` 0) cs
where
cs = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]
sp = char8 ',' -- column separator
nl = char8 '\n'
separator !cs !cl !xs !xl !ci !xi
| c == 1 = ps
| xi == xl = mempty -- at the end of bytestring, end recursion
| cl == ci = pr
| otherwise = pc
where
c = U.unsafeIndex cs ci -- get column separation indicator
w = word8 . U.unsafeIndex xs -- get char from BS at position
p = separator cs cl xs xl -- partial recursion call
pr = nl <> p 0 (xi + 1) -- end of row, put '\n', reset counter, recur
ps = sp <> p (ci + 1) xi -- end of column, put column separator, recur
pc = w xi <> p (ci + 1) (xi + 1) -- in the middle of column, copy byte, recur
main = do
contents <- B.getContents
BL.putStr . toLazyByteString $ init_sep sp_after_char contents
init_sep cs xs = separator cs (l cs) xs (l xs) 0 0
where l = fromIntegral . B.length
sp_after_char = B.pack col_sizes
香草发电机:
time ./data_builder | head -50000000 > /dev/null
./data_builder 0,02s user 1,09s system 30% cpu 3,709 total
head -50000000 > /dev/null 2,95s user 0,76s system 99% cpu 3,708 total
我的C解决方案:
@GabrielGonzalez-Haskell溶液
我的哈斯克尔解决方案
@卢克泰勒溶液
我编写了一些替代代码来逐行处理文件(因为
ByteString
本身就支持该功能),这似乎可以正常工作:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Int (Int64)
main = do
contents <- B.getContents
mapM B.putStrLn (process contents)
process :: B.ByteString -> [B.ByteString]
process bs = do
line <- B.lines bs
return $ B.intercalate "," $ splitLine line indices
splitLine :: B.ByteString -> [Int64] -> [B.ByteString]
splitLine l [] = [l]
splitLine l (i:is) = let (head, tail) = B.splitAt i l
in head : splitLine tail is
indices = [1,2,3] :: [Int64]
{-#语言重载字符串}
将限定数据.ByteString.Lazy.Char8作为B导入
导入Data.Int(Int64)
main=do
内容[B.ByteString]
过程bs=do
行[Int64]->[B.ByteString]
分割线l[]=[l]
分割线l(i:is)=let(头,尾)=B.splitAt i l
头部:分裂线尾部是
索引=[1,2,3]:[Int64]
如果我创建了一个包含500000个字符串“1223334444”副本的文件,那么它将在一秒钟左右运行:
$ time ./blines < blah.txt > blah2.txt
real 0m1.280s
user 0m1.192s
sys 0m0.080s
$time./blinesblah2.txt
实0m1.280s
用户0m1.192s
sys 0m0.080s
这和你想要达到的目标相符吗
更新:对于大量数据,这仍然非常慢。有800万行,大约需要15秒。仅通过使用原始指针操作,我就能够在C的3倍以内:
import Control.Monad (unless, when, void)
import Foreign.Safe hiding (void)
import System.IO
import Foreign.C.Types
bufInSize :: Int
bufInSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + 1
bufOutSize :: Int
bufOutSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + length sizes0
sizes0 :: [Int]
sizes0 = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]
-- I also tried using the C memset using the FFI, but got the same speed
memcpy :: Ptr Word8 -> Ptr Word8 -> Int -> IO ()
memcpy dst src n = when (n > 0) $ do
x <- peek src
poke dst x
memcpy (dst `plusPtr` 1) (src `plusPtr` 1) (n - 1)
main = do
allocaArray bufInSize $ \bufIn0 -> do
allocaArray bufOutSize $ \bufOut0 -> do
with (44 :: Word8) $ \cm -> do
let loop bufIn bufOut sizes suffixIn suffixOut = do
let (bytesIn, bytesOut, sizes', copy) = case sizes of
[] -> (1, 1 , sizes0, memcpy bufOut bufIn 1)
[s] -> (s, s , [] , memcpy bufOut bufIn s)
s:izes -> (s, s + 1, izes , do
memcpy bufOut bufIn s
memcpy (bufOut `plusPtr` s) cm 1 )
if suffixIn < bytesIn
then do
eof <- hIsEOF stdin
if eof
then hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
else do
suffixIn' <- hGetBuf stdin bufIn0 bufInSize
loop bufIn0 bufOut sizes suffixIn' suffixOut
else if suffixOut < bytesOut
then do
hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
loop bufIn bufOut0 sizes suffixIn bufOutSize
else do
copy
loop (bufIn `plusPtr` bytesIn )
(bufOut `plusPtr` bytesOut)
sizes'
(suffixIn - bytesIn )
(suffixOut - bytesOut)
loop bufIn0 bufOut0 sizes0 0 bufOutSize
“互联网已经阅读和应用了,这就是我最后的结论。你能指出我应该考虑的这一章的任何特定片段吗?请提供测试数据,这样我们就可以在提交答案之前对你的C代码进行基准测试。”GabrielGonzalez,我想你可以从代码50中建立合适的测试数据。百万行,每行72个字符长,根据
cs
数组中的值分割。感谢您的回答,我尝试了这种方法,处理数据需要5分钟。有趣的是,当您在普通Haskell字符串上操作并使用getLine和putStrLn时,此代码的执行时间类似。我想一个问题是splitAt每次都会进行不必要的复制,而且列越多,浪费就越大。一次构建列会更好,但我还不确定如何做到:)。理论上,从ByteString和ByteString中分离、尾等。Lazy必须为通过TestRing收集数据产生巨大的开销。B.unsafeIndex是最快的找到了。我唯一想到的不是FFI到C,而是在B.map或B.filter的引擎盖下寻找,并使用一些内部构件(不幸的是,它们被隐藏了)@KonradKuźnickiByteString
内部并没有隐藏,特别是为了允许使用它们创建您自己的高性能函数。但是,内部
模块的haddock页面不是建立在黑客基础上的,可能与safe haskell有关。请尝试使用mapM
而不是mapM
time ./tocsvh1 < test.txt > /dev/null
./tocsv < test.txt > /dev/null 19,56s user 0,41s system 100% cpu 19,950 total
time ./data_builder | head -50000000 | ./tocsvh1 > /dev/null
./data_builder 0,11s user 3,04s system 7% cpu 41,320 total
head -50000000 7,29s user 3,56s system 26% cpu 41,319 total
./tocsvh2 > /dev/null 33,01s user 2,42s system 85% cpu 41,327 total
time ./tocsvh2 < test.txt > /dev/null
./tocsvh2 < test.txt > /dev/null 128,63s user 2,95s system 100% cpu 2:11,45 total
time ./data_builder | head -50000000 | ./tocsvh2 > /dev/null
./data_builder 0,02s user 1,26s system 28% cpu 4,526 total
head -50000000 3,17s user 1,33s system 99% cpu 4,524 total
./tocsvh2 > /dev/null 129,95s user 3,33s system 98% cpu 2:14,75 total
time ./tocsvh3 < test.txt > /dev/null
./tocsv < test.txt > /dev/null 324,38s user 4,13s system 100% cpu 5:28,18 total
time ./data_builder | head -50000000 | ./tocsvh3 > /dev/null
./data_builder 0,43s user 4,46s system 1% cpu 5:30,34 total
head -50000000 5,20s user 2,82s system 2% cpu 5:30,34 total
./tocsv > /dev/null 329,08s user 4,21s system 100% cpu 5:32,96 total
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Int (Int64)
main = do
contents <- B.getContents
mapM B.putStrLn (process contents)
process :: B.ByteString -> [B.ByteString]
process bs = do
line <- B.lines bs
return $ B.intercalate "," $ splitLine line indices
splitLine :: B.ByteString -> [Int64] -> [B.ByteString]
splitLine l [] = [l]
splitLine l (i:is) = let (head, tail) = B.splitAt i l
in head : splitLine tail is
indices = [1,2,3] :: [Int64]
$ time ./blines < blah.txt > blah2.txt
real 0m1.280s
user 0m1.192s
sys 0m0.080s
import Control.Monad (unless, when, void)
import Foreign.Safe hiding (void)
import System.IO
import Foreign.C.Types
bufInSize :: Int
bufInSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + 1
bufOutSize :: Int
bufOutSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + length sizes0
sizes0 :: [Int]
sizes0 = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]
-- I also tried using the C memset using the FFI, but got the same speed
memcpy :: Ptr Word8 -> Ptr Word8 -> Int -> IO ()
memcpy dst src n = when (n > 0) $ do
x <- peek src
poke dst x
memcpy (dst `plusPtr` 1) (src `plusPtr` 1) (n - 1)
main = do
allocaArray bufInSize $ \bufIn0 -> do
allocaArray bufOutSize $ \bufOut0 -> do
with (44 :: Word8) $ \cm -> do
let loop bufIn bufOut sizes suffixIn suffixOut = do
let (bytesIn, bytesOut, sizes', copy) = case sizes of
[] -> (1, 1 , sizes0, memcpy bufOut bufIn 1)
[s] -> (s, s , [] , memcpy bufOut bufIn s)
s:izes -> (s, s + 1, izes , do
memcpy bufOut bufIn s
memcpy (bufOut `plusPtr` s) cm 1 )
if suffixIn < bytesIn
then do
eof <- hIsEOF stdin
if eof
then hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
else do
suffixIn' <- hGetBuf stdin bufIn0 bufInSize
loop bufIn0 bufOut sizes suffixIn' suffixOut
else if suffixOut < bytesOut
then do
hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
loop bufIn bufOut0 sizes suffixIn bufOutSize
else do
copy
loop (bufIn `plusPtr` bytesIn )
(bufOut `plusPtr` bytesOut)
sizes'
(suffixIn - bytesIn )
(suffixOut - bytesOut)
loop bufIn0 bufOut0 sizes0 0 bufOutSize
$ # The C Version
$ time ./a.out < in.dat > out.dat
real 0m0.189s
user 0m0.116s
sys 0m0.068s
$ # The Haskell version
$ time ./csv < in.dat > out2.dat
real 0m0.536s
user 0m0.428s
sys 0m0.104s
$ diff out.dat out2.dat
$ # No difference