Performance Haskell FFI/C的性能考虑因素？_Performance_Haskell_Parallel Processing_Ffi

Performance Haskell FFI/C的性能考虑因素？

performance haskell parallel-processing

Performance Haskell FFI/C的性能考虑因素？,performance,haskell,parallel-processing,ffi,Performance,Haskell,Parallel Processing,Ffi,如果将Haskell用作从我的C程序调用的库，调用它对性能有什么影响？例如，如果我有一个有问题的世界数据集，比如20kB的数据，我想运行如下操作： // Go through my 1000 actors and have them make a decision based on // HaskellCode() function, which is compiled Haskell I'm accessing through // the FFI. As an argument, send

如果将Haskell用作从我的C程序调用的库，调用它对性能有什么影响？例如，如果我有一个有问题的世界数据集，比如20kB的数据，我想运行如下操作：

// Go through my 1000 actors and have them make a decision based on
// HaskellCode() function, which is compiled Haskell I'm accessing through
// the FFI.  As an argument, send in the SAME 20kB of data to EACH of these
// function calls, and some actor specific data
// The 20kB constant data defines the environment and the actor specific
// data could be their personality or state
for(i = 0; i < 1000; i++)
   actor[i].decision = HaskellCode(20kB of data here, actor[i].personality);

//检查我的1000名演员，让他们根据
//Haskell code（）函数，该函数是通过Haskell编译的
//外国金融机构。作为参数，将相同的20kB数据发送到每个
//函数调用和一些特定于参与者的数据
//20kB常量数据定义了环境和特定于参与者的数据
//数据可以是他们的个性或状态
对于（i=0；i<1000；i++）
actor[i].decision=haskell代码（这里有20kB的数据，actor[i].personality）；

这里将要发生什么-我是否有可能将20kB的数据保留为Haskell代码访问的某个地方的全局不可变引用，或者我必须每次通过创建该数据的副本

令人担忧的是，这些数据可能会越来越大——我还希望使用Haskell代码的多次调用所使用的不变数据的相同模式，编写对更大数据集起作用的算法

另外，我想将其并行化，比如dispatch_apply（）GCD或Parallel.ForEach（..）C#。我在Haskell之外进行并行化的基本原理是，我知道我将始终处理许多单独的函数调用，即1000个参与者，因此在Haskell函数内部使用细粒度并行化并不比在C级别管理好多少。运行FFI Haskell实例是“线程安全”的吗？如何实现这一点？是否每次启动并行运行时都需要初始化Haskell实例？（如果必须的话，似乎很慢…）我如何以良好的性能实现这一点？

调用它对性能有什么影响

假设您只启动Haskell运行时一次（），在我的机器上，从C向Haskell进行函数调用，在边界上来回传递一个Int，大约需要80000个周期（31000 ns，在我的Core 2上）——通过rdstc寄存器通过实验确定

对于我来说，是否有可能将20kB的数据保留为Haskell代码访问的某个全局不可变引用

是的，那当然是可能的。如果数据确实是不可变的，则无论您：

通过编组在语言边界上来回遍历数据
来回传递对数据的引用
或者将其缓存在Haskell端的
```
IORef
```
中

哪种策略最好？这取决于数据类型。最惯用的方法是来回传递对C数据的引用，在Haskell端将其视为

ByteString

或

Vector

我想将其并行化

我强烈建议反转控件，并从Haskell运行时执行并行化——这将更加健壮，因为该路径已经过大量测试

关于线程安全，对在同一运行时运行的

外部导出的

函数进行并行调用显然是安全的——尽管可以肯定没有人为了获得并行性而尝试过这样做。中的调用获取一个功能，这本质上是一个锁，因此多个调用可能会阻塞，从而降低并行性的机会。在多核情况下（例如，

-N4

左右），您的结果可能会有所不同（有多种功能可用），但是，这几乎肯定是提高性能的一种糟糕方法

同样，从Haskell通过

forkIO

进行许多并行函数调用是一种更好的记录、更好的测试路径，与在C端进行的工作相比，开销更少，最终代码可能更少

只需调用Haskell函数，它将通过许多Haskell线程实现并行性。轻松点

免责声明：我没有外国金融机构的经验

但在我看来，如果你想重用20KB的数据，这样你就不会每次都传递它，那么你可以简单地使用一个方法，获取一个“个性”列表，并返回一个“决策”列表

如果你有一个函数

f :: LotsaData -> Personality -> Decision
f data p = ...

那么为什么不创建一个helper函数呢

helper :: LotsaData -> [Personality] -> [Decision]
helper data ps = map (f data) ps

然后调用它？不过，使用这种方式，如果您想要并行化，则需要使用并行列表和并行映射来实现Haskell端

我听从专家们的解释，C数组是否/如何可以很容易地编组到Haskell列表（或类似结构）中。

我在我的一个应用程序中混合使用了C线程和Haskell线程，并没有注意到在两者之间切换会对性能造成多大影响。所以我制作了一个简单的基准。。。这比唐的要快一点/便宜一点。这是在2.66GHz i7上测量1000万次迭代：

$ ./foo
IO  : 2381952795 nanoseconds total, 238.195279 nanoseconds per, 160000000 value
Pure: 2188546976 nanoseconds total, 218.854698 nanoseconds per, 160000000 value

在OSX 10.6上使用GHC 7.0.3/x86_64和gcc-4.2.1编译

ghc -no-hs-main -lstdc++ -O2 -optc-O2 -o foo ForeignExportCost.hs Driver.cpp

哈斯克尔：

{-# LANGUAGE ForeignFunctionInterface #-}

module ForeignExportCost where

import Foreign.C.Types

foreign export ccall simpleFunction :: CInt -> CInt
simpleFunction i = i * i

foreign export ccall simpleFunctionIO :: CInt -> IO CInt
simpleFunctionIO i = return (i * i)

和OSX C++应用程序驱动它，应该是简单的适应Windows或Linux：

#include <stdio.h>
#include <mach/mach_time.h>
#include <mach/kern_return.h>
#include <HsFFI.h>
#include "ForeignExportCost_stub.h"

static const int s_loop = 10000000;

int main(int argc, char** argv) {
    hs_init(&argc, &argv);

    struct mach_timebase_info timebase_info = { };
    kern_return_t err;
    err = mach_timebase_info(&timebase_info);
    if (err != KERN_SUCCESS) {
        fprintf(stderr, "error: %x\n", err);
        return err;
    }

    // timing a function in IO
    uint64_t start = mach_absolute_time();
    HsInt32 val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunctionIO(4);
    }

    // in nanoseconds per http://developer.apple.com/library/mac/#qa/qa1398/_index.html
    uint64_t duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    double duration_per = static_cast<double>(duration) / s_loop;
    printf("IO  : %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    // run the loop again with a pure function
    start = mach_absolute_time();
    val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunction(4);
    }

    duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    duration_per = static_cast<double>(duration) / s_loop;
    printf("Pure: %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    hs_exit();
}

#包括
#包括
#包括
#包括
#包括“ForeignExportCost_stub.h”
静态常量int s_循环=10000000；
int main（int argc，字符**argv）{
hs_init（&argc，&argv）；
结构mach_timebase_info timebase_info={}；
内核返回错误；
err=马赫时基信息（&时基信息）；
if（err！=KERN_SUCCESS）{
fprintf（标准，“错误：%x\n”，错误）；
返回错误；
}
//IO中函数的计时
uint64启动=马赫绝对时间（）；
HsInt32 val=0；
对于（int i=0；i