C++ 正则表达式：Ubuntu（15.10）-Clang++产生的二进制文件比Debian-8-Clang++（都是v.3.4）性能好得多_C++_Regex_Parsing_Ubuntu_Debian

C++ 正则表达式：Ubuntu（15.10）-Clang++产生的二进制文件比Debian-8-Clang++（都是v.3.4）性能好得多

c++ regex parsing ubuntu debian

C++ 正则表达式：Ubuntu（15.10）-Clang++产生的二进制文件比Debian-8-Clang++（都是v.3.4）性能好得多,c++,regex,parsing,ubuntu,debian,C++,Regex,Parsing,Ubuntu,Debian,我创建了一个测试程序，它在解析csv数据时测量std:：regex的性能： #include <string.h> #include <iostream> #include <stdexcept> #include <chrono> #include <regex> #include <set> #include <iomanip> #define DEFAULT_REGEX

我创建了一个测试程序，它在解析csv数据时测量std:：regex的性能：

#include <string.h>
#include <iostream>
#include <stdexcept>
#include <chrono>
#include <regex>
#include <set>
#include <iomanip>

#define DEFAULT_REGEX                                 \
    R"(^((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"   \
    R"((L|P|D|DN|R|W|LS|PS|RS|LU|PU|RU|LK|PK|RK|F);)" \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"    \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"    \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;|\\:)*))" \
    R"((?:;((?:[^\x00-\x1F\x80-\xFF\\;])"             \
    R"(|\\\\|\\;|\';\')*))?$)"

struct results_t {
    std::string address;
    std::string command;
    std::string client;
    std::string param;
    std::string value;
    std::string error;
};

void std_regex(std::size_t num, const std::string &str, results_t &res) {
    std::smatch pieces;
    static const std::regex pattern{DEFAULT_REGEX};
    for (auto i = 0u; i < num; i++) {
        bool matched = std::regex_match(str, pieces, pattern);
        if (!(matched && pieces.size() == 7)) {
            throw std::runtime_error("ERROR");
        }
    }
    res.address = pieces[1];
    res.command = pieces[2];
    res.client = pieces[3];
    res.param = pieces[4];
    res.value = pieces[5];
    res.error = pieces[6];
}

std::size_t get_median(const std::multiset<std::size_t> &measured_values) {
    std::size_t i = 0;
    std::size_t median = 0;
    for (auto it = measured_values.cbegin();; it++, i++) {
        double tmp = static_cast<double>(measured_values.size() - 1) / 2.0;
        if (i == floor(tmp)) {
            median = *it;
        }
        if (i == ceil(tmp)) {
            median += *it;
            break;
        }
    }
    return static_cast<std::size_t>(static_cast<double>(median) / 2.0 + 0.5);
}

std::size_t get_avg(const std::multiset<std::size_t> &measured_values) {
    return static_cast<std::size_t>(
        std::accumulate(measured_values.cbegin(), measured_values.cend(), 0) /
            static_cast<double>(measured_values.size()) +
        0.5);
}

int main(void) {
    constexpr std::size_t num = 100000;
    constexpr std::size_t measure_num = 250;
    std::string str = "zzz\\\\bbbb;L;babaa;bubu\\;cc;vvvv;asdff";

    std::multiset<std::size_t> measured_values;
    results_t res;

    for (std::size_t i = 0; i < measure_num; i++) {
        auto start = std::chrono::system_clock::now();
        std_regex(num, str, res);
        auto end = std::chrono::system_clock::now();
        measured_values.insert(
            std::chrono::duration_cast<std::chrono::microseconds>(end - start)
                .count());
    }

    std::cout << *measured_values.cbegin() << ";"           // min
              << *measured_values.crbegin() << ";"          /// max
              << get_avg(measured_values) << ";"            // average
              << get_median(measured_values) << std::endl;  // median
}

正如所料，如果使用不同的编译器，该程序将显示不同的时间。例如，如果使用g++5.2而不是g++4.9，性能会变得更好

但是这个评估程序也显示了一个有趣的特性：如果你在Debian8上使用clang++-3.4而不是Ubuntu 15.10，它会产生更多的错误。该软件在同一台机器Intel i7-3770k和8GB RAM上运行两次，在这两种情况下，都使用clang++-3.4

评估执行了250次，在下面的几行中，您可以看到此测量的统计信息

以下是Debian 8:min上的测量值；最大值；平均值；中间带

691244;1160628;713112;700739

198484;290986;202656;200637

以下是Ubuntu 15.10上的测量值：min；最大值；平均值；中间带

691244;1160628;713112;700739

198484;290986;202656;200637

如果相差10%或20%，我不在乎这个，但在这种情况下，相差大约350%

为什么在执行这个二进制文件时会有如此大的差异？

基准测试看起来有致命的缺陷，因为您将样本存储在集合中，而不是多集合中

我将发布一个带有Nonius微基准框架的固定版本，并展示GCC 5和Clang 3.6之间的差异

简单比较：未使用：见下文

GCC/libstdc++输出-

Clang/libc++输出

Clang/libstdc++输出

结论？很明显

测量这种无噪声的微基准是很困难的 libc++与clang结合起来似乎要慢2倍左右使用clang/gcc的影响似乎较小，虽然平均来说有一些差异，但这种差异使得很难说它是相关的代码清单

我已经做了更多的基准测试，详细阐述了中的测试

我在中创建了其他解析器实现

灵气v2.x 仅限Spirit X3 c++14，实验版一个手动解析器编写了c++14样式，但可以很容易地使其成为c++03 业绩结果：交互式图形：交互式图形：交互式图形：交互式图形：交互式图形：交互式图形：显然，不管使用何种编译器，手写解析器都是胜利者

精神X3在一个明确的秒

Spirit Qi与std_正则表达式的性能完全匹配，但在libc++上除外，因为那里的std_正则表达式非常慢

总结：我建议使用Spirit或手动解析器，因为：

正则表达式实际上是一个需要维护的噩梦所有这三种选择都会给您提供更有用的结果，因为转义序列实际上是被解释的，所以您不必再次处理它们 X3语法很容易维护备选方案1：Spirit X3 如果您能负担得起使用需要C++14的实验性boost库，那么这是我个人最喜欢的。查看代码，您将看到原因：

void spiritX3(const std::string &payload, results_t &res) {

    using namespace boost::spirit::x3;
    auto escaping = [](auto&& set) { return ('\\' >> char_(set)) | (print - char_(set)); };
    auto text     = escaping(";\\");

    symbols<unused_type> cmds;
    cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";

    auto address_ = *text;
    auto command_ = raw [ cmds ];
    auto client_  = *text;
    auto param_   = *text;
    auto value_   = *escaping(";:\\"); // note the ':'
    auto error_   = *("'" >> char_(';') >> "'" | escaping(";\\"));

    auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);

    if (!parse(
            payload.begin(), payload.end(),
            address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
            attr)) 
    {
        throw std::runtime_error("ERROR");
    }
}

样本输出使用libc++关闭配置clang3.6的输出：

---- parsed with regex:
address: zzz\\bbbb
command: L
client:  babaa
param:   bubu\;cc
value:   vvvv
error:   asd';'ff
---- parsed with manual parser (note: unescaping taken care of):
address: zzz\bbbb
command: L
client:  babaa
param:   bubu;cc
value:   vvvv
error:   asd;ff
---- parsed with spirit Qi (note: unescaping taken care of):
address: zzz\bbbb
command: L
client:  babaa
param:   bubu;cc
value:   vvvv
error:   asd;ff
clock resolution: mean is 16.9379 ns (40960002 iterations)

benchmarking std_regex
collecting 100 samples, 1 iterations each, in estimated 4.968 ms
mean: 15.2716 μs, lb 14.8763 μs, ub 16.1072 μs, ci 0.95
std dev: 2.81028 μs, lb 1668.21 ns, ub 5.63468 μs, ci 0.95
found 1 outliers among 100 samples (1%)
variance is severely inflated by outliers

benchmarking spirit Qi
collecting 100 samples, 7 iterations each, in estimated 1780.1 μs
mean: 2.15209 μs, lb 2.06754 μs, ub 2.22874 μs, ci 0.95
std dev: 412.372 ns, lb 369.921 ns, ub 453.462 ns, ci 0.95
found 0 outliers among 100 samples (0%)
variance is severely inflated by outliers

benchmarking manual
collecting 100 samples, 37 iterations each, in estimated 1705.7 μs
mean: 451.902 ns, lb 448.665 ns, ub 459.504 ns, ci 0.95
std dev: 23.7123 ns, lb 7.42683 ns, ub 41.7546 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers

除了其他基准测试问题外，您正在测试标准iLibrary组件，而不是编译器。这两台机器可能有不同的库版本。是的。这是libc++与libstdc++的对抗。不过，不需要不同的机器。提醒你，这是OP在做的，所以也许你是想在问题上发表你的评论？哎呀，我已经用std:：set在我最初的postYep中纠正了这个问题，使用android进行评论是不精确的：-重点是仅仅将g++更改为clang++并不会改变你使用的库。@byteunit我已经更新了我的答案。基准测试有很多微妙之处，我衷心推荐令人敬畏的库。看起来，错误的正则表达式是错误的。它有。。。奇怪的括号。你真的是说|\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\那里回答：不，正则表达式没有错，左边的括号包含要知道的内容。你没有领会我的意思。如果你感兴趣，我有一个更易于维护且速度更快的替代实现：

#include <iostream>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>
#include <regex>

#define DEFAULT_REGEX                                 \
    R"(^((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"   \
    R"((L|P|D|DN|R|W|LS|PS|RS|LU|PU|RU|LK|PK|RK|F);)" \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"    \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)"    \
    R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;|\\:)*))" \
    R"((?:;((?:[^\x00-\x1F\x80-\xFF\\;])"             \
    R"(|\\\\|\\;|\';\')*))?$)"

struct results_t {
    std::string address, command, client, param, value, error;
};

static const std::regex pattern{DEFAULT_REGEX};

void std_regex(const std::string &payload, results_t &res) {
    std::smatch pieces;
    bool matched = std::regex_match(payload, pieces, pattern);

    if (!matched || pieces.size() != 7) {
        throw std::runtime_error("ERROR");
    }

    res = { pieces[1], pieces[2], pieces[3], pieces[4], pieces[5], pieces[6] };
}

static std::string const payload = "zzz\\\\bbbb;L;babaa;bubu\\;cc;vvvv;asdff";

NONIUS_BENCHMARK("testcase", [](/*nonius::chronometer cm*/) {
    results_t res;
    std_regex(payload, res);
})

void spiritX3(const std::string &payload, results_t &res) {

    using namespace boost::spirit::x3;
    auto escaping = [](auto&& set) { return ('\\' >> char_(set)) | (print - char_(set)); };
    auto text     = escaping(";\\");

    symbols<unused_type> cmds;
    cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";

    auto address_ = *text;
    auto command_ = raw [ cmds ];
    auto client_  = *text;
    auto param_   = *text;
    auto value_   = *escaping(";:\\"); // note the ':'
    auto error_   = *("'" >> char_(';') >> "'" | escaping(";\\"));

    auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);

    if (!parse(
            payload.begin(), payload.end(),
            address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
            attr)) 
    {
        throw std::runtime_error("ERROR");
    }
}

void spiritQi(const std::string &payload, results_t &res) {

    using namespace boost::spirit::qi;

#define ESCAPING(set) (('\\' >> char_(set)) | (print - char_(set)))
#define TEXT *ESCAPING(";\\")

    symbols<char, unused_type> cmds;
    cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";

    using It = std::string::const_iterator;
    rule<It, std::string()> address_ = TEXT;
    rule<It, std::string()> command_ = raw [ cmds ];
    rule<It, std::string()> client_  = TEXT;
    rule<It, std::string()> param_   = TEXT;
    rule<It, std::string()> value_   = *ESCAPING(";:\\"); // note the ':'
    rule<It, std::string()> error_   = *("'" >> char_(';') >> "'" | ESCAPING(";\\"));

    BOOST_SPIRIT_DEBUG_NODES((address_)(command_)(client_)(param_)(value_)(error_))

#undef TEXT
#undef ESCAPING

    auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);

    if (!parse(
            payload.begin(), payload.end(),
            address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
            attr)) 
    {
        throw std::runtime_error("ERROR");
    }
}

void manual(const std::string &payload, results_t &res) {

    using It = std::string::const_iterator;
    It       it  = payload.begin();
    It const end = payload.end();

    auto consume = [&](char const* escape_set, std::string& into, auto&& specials) {
        while (it != end)
            if (!specials(into)) switch (*it) {
                case '\\':
                    if (++it != end && strchr(escape_set, *it))
                        into += *it++;
                    else
                        throw "invalid escape";
                    break;
                default:
                    if (isprint(*it) && !strchr(escape_set, *it))
                        into += *it++;
                    else
                        return true;
            }
        return true;
    };

    auto escaping = [&](char const* escape_set, std::string& into) {
        return consume(escape_set, into, [](std::string&) { return false; });
    };
    auto matched = [&](char const* what) {
        auto saved = it;
        auto wit = what;
        while (*wit) {
            if (it != end && *wit == *it)
                { ++wit; ++it; }
            else {
                it = saved;
                // throw "expected: '" + std::string(what);
                return false;
            }
        }

        return true;
    };

    auto expect = [&](char const* what) {
        if (!matched(what))
            throw "expected: '" + std::string(what);
        return true;
    };

    auto cmd = [&](std::string& into) {
        static const char *const cmds[] = { "D", "DN", "F", "L", "LK", "LS", "LU", "P", "PK", "PS", "PU", "R", "RK", "RS", "RU", "W" };
        for (auto cmd : cmds)
            if (matched(cmd)) {
                into.assign(cmd);
                return true;
            }
        return false;
    };

    bool ok =  escaping(";\\", res.address) && expect(";")
            && cmd(res.command)                 && expect(";")
            && escaping(";\\",  res.client)     && expect(";")
            && escaping(";\\",  res.param)      && expect(";")
            && escaping(":;\\", res.value);

    auto squoted_semicolon = [&](std::string& into) {
        if (!matched("';'"))
            return false;
        into += ';';
        return true;
    };

    ok &= (it==end) || (expect(";") && consume(";\\", res.error, squoted_semicolon));

    if (!ok)
        throw std::runtime_error("ERROR");
}

---- parsed with regex:
address: zzz\\bbbb
command: L
client:  babaa
param:   bubu\;cc
value:   vvvv
error:   asd';'ff
---- parsed with manual parser (note: unescaping taken care of):
address: zzz\bbbb
command: L
client:  babaa
param:   bubu;cc
value:   vvvv
error:   asd;ff
---- parsed with spirit Qi (note: unescaping taken care of):
address: zzz\bbbb
command: L
client:  babaa
param:   bubu;cc
value:   vvvv
error:   asd;ff
clock resolution: mean is 16.9379 ns (40960002 iterations)

benchmarking std_regex
collecting 100 samples, 1 iterations each, in estimated 4.968 ms
mean: 15.2716 μs, lb 14.8763 μs, ub 16.1072 μs, ci 0.95
std dev: 2.81028 μs, lb 1668.21 ns, ub 5.63468 μs, ci 0.95
found 1 outliers among 100 samples (1%)
variance is severely inflated by outliers

benchmarking spirit Qi
collecting 100 samples, 7 iterations each, in estimated 1780.1 μs
mean: 2.15209 μs, lb 2.06754 μs, ub 2.22874 μs, ci 0.95
std dev: 412.372 ns, lb 369.921 ns, ub 453.462 ns, ci 0.95
found 0 outliers among 100 samples (0%)
variance is severely inflated by outliers

benchmarking manual
collecting 100 samples, 37 iterations each, in estimated 1705.7 μs
mean: 451.902 ns, lb 448.665 ns, ub 459.504 ns, ci 0.95
std dev: 23.7123 ns, lb 7.42683 ns, ub 41.7546 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers