C++ 正则表达式:Ubuntu(15.10)-Clang++产生的二进制文件比Debian-8-Clang++(都是v.3.4)性能好得多

C++ 正则表达式:Ubuntu(15.10)-Clang++产生的二进制文件比Debian-8-Clang++(都是v.3.4)性能好得多,c++,regex,parsing,ubuntu,debian,C++,Regex,Parsing,Ubuntu,Debian,我创建了一个测试程序,它在解析csv数据时测量std::regex的性能: #include <string.h> #include <iostream> #include <stdexcept> #include <chrono> #include <regex> #include <set> #include <iomanip> #define DEFAULT_REGEX

我创建了一个测试程序,它在解析csv数据时测量std::regex的性能:

#include <string.h>
#include <iostream>
#include <stdexcept>
#include <chrono>
#include <regex>
#include <set>
#include <iomanip>

#define DEFAULT_REGEX                                 \
    R"(^((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"   \
    R"((L|P|D|DN|R|W|LS|PS|RS|LU|PU|RU|LK|PK|RK|F);)" \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"    \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"    \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;|\\:)*))" \
    R"((?:;((?:[^\x00-\x1F\x80-\xFF\\;])"             \
    R"(|\\\\|\\;|\';\')*))?$)"

struct results_t {
    std::string address;
    std::string command;
    std::string client;
    std::string param;
    std::string value;
    std::string error;
};

void std_regex(std::size_t num, const std::string &str, results_t &res) {
    std::smatch pieces;
    static const std::regex pattern{DEFAULT_REGEX};
    for (auto i = 0u; i < num; i++) {
        bool matched = std::regex_match(str, pieces, pattern);
        if (!(matched && pieces.size() == 7)) {
            throw std::runtime_error("ERROR");
        }
    }
    res.address = pieces[1];
    res.command = pieces[2];
    res.client = pieces[3];
    res.param = pieces[4];
    res.value = pieces[5];
    res.error = pieces[6];
}

std::size_t get_median(const std::multiset<std::size_t> &measured_values) {
    std::size_t i = 0;
    std::size_t median = 0;
    for (auto it = measured_values.cbegin();; it++, i++) {
        double tmp = static_cast<double>(measured_values.size() - 1) / 2.0;
        if (i == floor(tmp)) {
            median = *it;
        }
        if (i == ceil(tmp)) {
            median += *it;
            break;
        }
    }
    return static_cast<std::size_t>(static_cast<double>(median) / 2.0 + 0.5);
}

std::size_t get_avg(const std::multiset<std::size_t> &measured_values) {
    return static_cast<std::size_t>(
        std::accumulate(measured_values.cbegin(), measured_values.cend(), 0) /
            static_cast<double>(measured_values.size()) +
        0.5);
}

int main(void) {
    constexpr std::size_t num = 100000;
    constexpr std::size_t measure_num = 250;
    std::string str = "zzz\\\\bbbb;L;babaa;bubu\\;cc;vvvv;asdff";

    std::multiset<std::size_t> measured_values;
    results_t res;

    for (std::size_t i = 0; i < measure_num; i++) {
        auto start = std::chrono::system_clock::now();
        std_regex(num, str, res);
        auto end = std::chrono::system_clock::now();
        measured_values.insert(
            std::chrono::duration_cast<std::chrono::microseconds>(end - start)
                .count());
    }

    std::cout << *measured_values.cbegin() << ";"           // min
              << *measured_values.crbegin() << ";"          /// max
              << get_avg(measured_values) << ";"            // average
              << get_median(measured_values) << std::endl;  // median
}
正如所料,如果使用不同的编译器,该程序将显示不同的时间。例如,如果使用g++5.2而不是g++4.9,性能会变得更好

但是这个评估程序也显示了一个有趣的特性:如果你在Debian8上使用clang++-3.4而不是Ubuntu 15.10,它会产生更多的错误。该软件在同一台机器Intel i7-3770k和8GB RAM上运行两次,在这两种情况下,都使用clang++-3.4

评估执行了250次,在下面的几行中,您可以看到此测量的统计信息

以下是Debian 8:min上的测量值;最大值;平均值;中间带

691244;1160628;713112;700739
198484;290986;202656;200637
以下是Ubuntu 15.10上的测量值:min;最大值;平均值;中间带

691244;1160628;713112;700739
198484;290986;202656;200637
如果相差10%或20%,我不在乎这个,但在这种情况下,相差大约350%


为什么在执行这个二进制文件时会有如此大的差异?

基准测试看起来有致命的缺陷,因为您将样本存储在集合中,而不是多集合中

我将发布一个带有Nonius微基准框架的固定版本,并展示GCC 5和Clang 3.6之间的差异

简单比较: 未使用:见下文

GCC/libstdc++输出-

Clang/libc++输出

Clang/libstdc++输出

结论? 很明显

测量这种无噪声的微基准是很困难的 libc++与clang结合起来似乎要慢2倍左右 使用clang/gcc的影响似乎较小,虽然平均来说有一些差异,但这种差异使得很难说它是相关的 代码清单
我已经做了更多的基准测试,详细阐述了中的测试

我在中创建了其他解析器实现

灵气v2.x 仅限Spirit X3 c++14,实验版 一个手动解析器编写了c++14样式,但可以很容易地使其成为c++03 业绩结果: 交互式图形: 交互式图形: 交互式图形: 交互式图形: 交互式图形: 交互式图形: 显然,不管使用何种编译器,手写解析器都是胜利者

精神X3在一个明确的秒

Spirit Qi与std_正则表达式的性能完全匹配,但在libc++上除外,因为那里的std_正则表达式非常慢

总结: 我建议使用Spirit或手动解析器,因为:

正则表达式实际上是一个需要维护的噩梦 所有这三种选择都会给您提供更有用的结果,因为转义序列实际上是被解释的,所以您不必再次处理它们 X3语法很容易维护 备选方案1:Spirit X3 如果您能负担得起使用需要C++14的实验性boost库,那么这是我个人最喜欢的。查看代码,您将看到原因:

void spiritX3(const std::string &payload, results_t &res) {

    using namespace boost::spirit::x3;
    auto escaping = [](auto&& set) { return ('\\' >> char_(set)) | (print - char_(set)); };
    auto text     = escaping(";\\");

    symbols<unused_type> cmds;
    cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";

    auto address_ = *text;
    auto command_ = raw [ cmds ];
    auto client_  = *text;
    auto param_   = *text;
    auto value_   = *escaping(";:\\"); // note the ':'
    auto error_   = *("'" >> char_(';') >> "'" | escaping(";\\"));

    auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);

    if (!parse(
            payload.begin(), payload.end(),
            address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
            attr)) 
    {
        throw std::runtime_error("ERROR");
    }
}
样本输出 使用libc++关闭配置clang3.6的输出:

---- parsed with regex:
address: zzz\\bbbb
command: L
client:  babaa
param:   bubu\;cc
value:   vvvv
error:   asd';'ff
---- parsed with manual parser (note: unescaping taken care of):
address: zzz\bbbb
command: L
client:  babaa
param:   bubu;cc
value:   vvvv
error:   asd;ff
---- parsed with spirit Qi (note: unescaping taken care of):
address: zzz\bbbb
command: L
client:  babaa
param:   bubu;cc
value:   vvvv
error:   asd;ff
clock resolution: mean is 16.9379 ns (40960002 iterations)

benchmarking std_regex
collecting 100 samples, 1 iterations each, in estimated 4.968 ms
mean: 15.2716 μs, lb 14.8763 μs, ub 16.1072 μs, ci 0.95
std dev: 2.81028 μs, lb 1668.21 ns, ub 5.63468 μs, ci 0.95
found 1 outliers among 100 samples (1%)
variance is severely inflated by outliers

benchmarking spirit Qi
collecting 100 samples, 7 iterations each, in estimated 1780.1 μs
mean: 2.15209 μs, lb 2.06754 μs, ub 2.22874 μs, ci 0.95
std dev: 412.372 ns, lb 369.921 ns, ub 453.462 ns, ci 0.95
found 0 outliers among 100 samples (0%)
variance is severely inflated by outliers

benchmarking manual
collecting 100 samples, 37 iterations each, in estimated 1705.7 μs
mean: 451.902 ns, lb 448.665 ns, ub 459.504 ns, ci 0.95
std dev: 23.7123 ns, lb 7.42683 ns, ub 41.7546 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers

除了其他基准测试问题外,您正在测试标准iLibrary组件,而不是编译器。这两台机器可能有不同的库版本。是的。这是libc++与libstdc++的对抗。不过,不需要不同的机器。提醒你,这是OP在做的,所以也许你是想在问题上发表你的评论?哎呀,我已经用std::set在我最初的postYep中纠正了这个问题,使用android进行评论是不精确的:-重点是仅仅将g++更改为clang++并不会改变你使用的库。@byteunit我已经更新了我的答案。基准测试有很多微妙之处,我衷心推荐令人敬畏的库。看起来,错误的正则表达式是错误的。它有。。。奇怪的括号。你真的是说|\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\那里回答:不,正则表达式没有错,左边的括号包含要知道的内容。你没有领会我的意思。如果你感兴趣,我有一个更易于维护且速度更快的替代实现:
#include <iostream>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>
#include <regex>

#define DEFAULT_REGEX                                 \
    R"(^((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"   \
    R"((L|P|D|DN|R|W|LS|PS|RS|LU|PU|RU|LK|PK|RK|F);)" \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"    \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"    \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;|\\:)*))" \
    R"((?:;((?:[^\x00-\x1F\x80-\xFF\\;])"             \
    R"(|\\\\|\\;|\';\')*))?$)"

struct results_t {
    std::string address, command, client, param, value, error;
};

static const std::regex pattern{DEFAULT_REGEX};

void std_regex(const std::string &payload, results_t &res) {
    std::smatch pieces;
    bool matched = std::regex_match(payload, pieces, pattern);

    if (!matched || pieces.size() != 7) {
        throw std::runtime_error("ERROR");
    }

    res = { pieces[1], pieces[2], pieces[3], pieces[4], pieces[5], pieces[6] };
}

static std::string const payload = "zzz\\\\bbbb;L;babaa;bubu\\;cc;vvvv;asdff";

NONIUS_BENCHMARK("testcase", [](/*nonius::chronometer cm*/) {
    results_t res;
    std_regex(payload, res);
})
void spiritX3(const std::string &payload, results_t &res) {

    using namespace boost::spirit::x3;
    auto escaping = [](auto&& set) { return ('\\' >> char_(set)) | (print - char_(set)); };
    auto text     = escaping(";\\");

    symbols<unused_type> cmds;
    cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";

    auto address_ = *text;
    auto command_ = raw [ cmds ];
    auto client_  = *text;
    auto param_   = *text;
    auto value_   = *escaping(";:\\"); // note the ':'
    auto error_   = *("'" >> char_(';') >> "'" | escaping(";\\"));

    auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);

    if (!parse(
            payload.begin(), payload.end(),
            address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
            attr)) 
    {
        throw std::runtime_error("ERROR");
    }
}
void spiritQi(const std::string &payload, results_t &res) {

    using namespace boost::spirit::qi;

#define ESCAPING(set) (('\\' >> char_(set)) | (print - char_(set)))
#define TEXT *ESCAPING(";\\")

    symbols<char, unused_type> cmds;
    cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";

    using It = std::string::const_iterator;
    rule<It, std::string()> address_ = TEXT;
    rule<It, std::string()> command_ = raw [ cmds ];
    rule<It, std::string()> client_  = TEXT;
    rule<It, std::string()> param_   = TEXT;
    rule<It, std::string()> value_   = *ESCAPING(";:\\"); // note the ':'
    rule<It, std::string()> error_   = *("'" >> char_(';') >> "'" | ESCAPING(";\\"));

    BOOST_SPIRIT_DEBUG_NODES((address_)(command_)(client_)(param_)(value_)(error_))

#undef TEXT
#undef ESCAPING

    auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);

    if (!parse(
            payload.begin(), payload.end(),
            address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
            attr)) 
    {
        throw std::runtime_error("ERROR");
    }
}
void manual(const std::string &payload, results_t &res) {

    using It = std::string::const_iterator;
    It       it  = payload.begin();
    It const end = payload.end();

    auto consume = [&](char const* escape_set, std::string& into, auto&& specials) {
        while (it != end)
            if (!specials(into)) switch (*it) {
                case '\\':
                    if (++it != end && strchr(escape_set, *it))
                        into += *it++;
                    else
                        throw "invalid escape";
                    break;
                default:
                    if (isprint(*it) && !strchr(escape_set, *it))
                        into += *it++;
                    else
                        return true;
            }
        return true;
    };

    auto escaping = [&](char const* escape_set, std::string& into) {
        return consume(escape_set, into, [](std::string&) { return false; });
    };
    auto matched = [&](char const* what) {
        auto saved = it;
        auto wit = what;
        while (*wit) {
            if (it != end && *wit == *it)
                { ++wit; ++it; }
            else {
                it = saved;
                // throw "expected: '" + std::string(what);
                return false;
            }
        }

        return true;
    };

    auto expect = [&](char const* what) {
        if (!matched(what))
            throw "expected: '" + std::string(what);
        return true;
    };

    auto cmd = [&](std::string& into) {
        static const char *const cmds[] = { "D", "DN", "F", "L", "LK", "LS", "LU", "P", "PK", "PS", "PU", "R", "RK", "RS", "RU", "W" };
        for (auto cmd : cmds)
            if (matched(cmd)) {
                into.assign(cmd);
                return true;
            }
        return false;
    };

    bool ok =  escaping(";\\", res.address) && expect(";")
            && cmd(res.command)                 && expect(";")
            && escaping(";\\",  res.client)     && expect(";")
            && escaping(";\\",  res.param)      && expect(";")
            && escaping(":;\\", res.value);

    auto squoted_semicolon = [&](std::string& into) {
        if (!matched("';'"))
            return false;
        into += ';';
        return true;
    };

    ok &= (it==end) || (expect(";") && consume(";\\", res.error, squoted_semicolon));

    if (!ok)
        throw std::runtime_error("ERROR");
}
---- parsed with regex:
address: zzz\\bbbb
command: L
client:  babaa
param:   bubu\;cc
value:   vvvv
error:   asd';'ff
---- parsed with manual parser (note: unescaping taken care of):
address: zzz\bbbb
command: L
client:  babaa
param:   bubu;cc
value:   vvvv
error:   asd;ff
---- parsed with spirit Qi (note: unescaping taken care of):
address: zzz\bbbb
command: L
client:  babaa
param:   bubu;cc
value:   vvvv
error:   asd;ff
clock resolution: mean is 16.9379 ns (40960002 iterations)

benchmarking std_regex
collecting 100 samples, 1 iterations each, in estimated 4.968 ms
mean: 15.2716 μs, lb 14.8763 μs, ub 16.1072 μs, ci 0.95
std dev: 2.81028 μs, lb 1668.21 ns, ub 5.63468 μs, ci 0.95
found 1 outliers among 100 samples (1%)
variance is severely inflated by outliers

benchmarking spirit Qi
collecting 100 samples, 7 iterations each, in estimated 1780.1 μs
mean: 2.15209 μs, lb 2.06754 μs, ub 2.22874 μs, ci 0.95
std dev: 412.372 ns, lb 369.921 ns, ub 453.462 ns, ci 0.95
found 0 outliers among 100 samples (0%)
variance is severely inflated by outliers

benchmarking manual
collecting 100 samples, 37 iterations each, in estimated 1705.7 μs
mean: 451.902 ns, lb 448.665 ns, ub 459.504 ns, ci 0.95
std dev: 23.7123 ns, lb 7.42683 ns, ub 41.7546 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers