C++ 正则表达式:Ubuntu(15.10)-Clang++产生的二进制文件比Debian-8-Clang++(都是v.3.4)性能好得多
我创建了一个测试程序,它在解析csv数据时测量std::regex的性能:C++ 正则表达式:Ubuntu(15.10)-Clang++产生的二进制文件比Debian-8-Clang++(都是v.3.4)性能好得多,c++,regex,parsing,ubuntu,debian,C++,Regex,Parsing,Ubuntu,Debian,我创建了一个测试程序,它在解析csv数据时测量std::regex的性能: #include <string.h> #include <iostream> #include <stdexcept> #include <chrono> #include <regex> #include <set> #include <iomanip> #define DEFAULT_REGEX
#include <string.h>
#include <iostream>
#include <stdexcept>
#include <chrono>
#include <regex>
#include <set>
#include <iomanip>
#define DEFAULT_REGEX \
R"(^((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)" \
R"((L|P|D|DN|R|W|LS|PS|RS|LU|PU|RU|LK|PK|RK|F);)" \
R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)" \
R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)" \
R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;|\\:)*))" \
R"((?:;((?:[^\x00-\x1F\x80-\xFF\\;])" \
R"(|\\\\|\\;|\';\')*))?$)"
struct results_t {
std::string address;
std::string command;
std::string client;
std::string param;
std::string value;
std::string error;
};
void std_regex(std::size_t num, const std::string &str, results_t &res) {
std::smatch pieces;
static const std::regex pattern{DEFAULT_REGEX};
for (auto i = 0u; i < num; i++) {
bool matched = std::regex_match(str, pieces, pattern);
if (!(matched && pieces.size() == 7)) {
throw std::runtime_error("ERROR");
}
}
res.address = pieces[1];
res.command = pieces[2];
res.client = pieces[3];
res.param = pieces[4];
res.value = pieces[5];
res.error = pieces[6];
}
std::size_t get_median(const std::multiset<std::size_t> &measured_values) {
std::size_t i = 0;
std::size_t median = 0;
for (auto it = measured_values.cbegin();; it++, i++) {
double tmp = static_cast<double>(measured_values.size() - 1) / 2.0;
if (i == floor(tmp)) {
median = *it;
}
if (i == ceil(tmp)) {
median += *it;
break;
}
}
return static_cast<std::size_t>(static_cast<double>(median) / 2.0 + 0.5);
}
std::size_t get_avg(const std::multiset<std::size_t> &measured_values) {
return static_cast<std::size_t>(
std::accumulate(measured_values.cbegin(), measured_values.cend(), 0) /
static_cast<double>(measured_values.size()) +
0.5);
}
int main(void) {
constexpr std::size_t num = 100000;
constexpr std::size_t measure_num = 250;
std::string str = "zzz\\\\bbbb;L;babaa;bubu\\;cc;vvvv;asdff";
std::multiset<std::size_t> measured_values;
results_t res;
for (std::size_t i = 0; i < measure_num; i++) {
auto start = std::chrono::system_clock::now();
std_regex(num, str, res);
auto end = std::chrono::system_clock::now();
measured_values.insert(
std::chrono::duration_cast<std::chrono::microseconds>(end - start)
.count());
}
std::cout << *measured_values.cbegin() << ";" // min
<< *measured_values.crbegin() << ";" /// max
<< get_avg(measured_values) << ";" // average
<< get_median(measured_values) << std::endl; // median
}
正如所料,如果使用不同的编译器,该程序将显示不同的时间。例如,如果使用g++5.2而不是g++4.9,性能会变得更好
但是这个评估程序也显示了一个有趣的特性:如果你在Debian8上使用clang++-3.4而不是Ubuntu 15.10,它会产生更多的错误。该软件在同一台机器Intel i7-3770k和8GB RAM上运行两次,在这两种情况下,都使用clang++-3.4
评估执行了250次,在下面的几行中,您可以看到此测量的统计信息
以下是Debian 8:min上的测量值;最大值;平均值;中间带
691244;1160628;713112;700739
198484;290986;202656;200637
以下是Ubuntu 15.10上的测量值:min;最大值;平均值;中间带
691244;1160628;713112;700739
198484;290986;202656;200637
如果相差10%或20%,我不在乎这个,但在这种情况下,相差大约350%
为什么在执行这个二进制文件时会有如此大的差异?基准测试看起来有致命的缺陷,因为您将样本存储在集合中,而不是多集合中 我将发布一个带有Nonius微基准框架的固定版本,并展示GCC 5和Clang 3.6之间的差异 简单比较: 未使用:见下文 GCC/libstdc++输出- Clang/libc++输出 Clang/libstdc++输出 结论? 很明显 测量这种无噪声的微基准是很困难的 libc++与clang结合起来似乎要慢2倍左右 使用clang/gcc的影响似乎较小,虽然平均来说有一些差异,但这种差异使得很难说它是相关的 代码清单
我已经做了更多的基准测试,详细阐述了中的测试 我在中创建了其他解析器实现 灵气v2.x 仅限Spirit X3 c++14,实验版 一个手动解析器编写了c++14样式,但可以很容易地使其成为c++03 业绩结果: 交互式图形: 交互式图形: 交互式图形: 交互式图形: 交互式图形: 交互式图形: 显然,不管使用何种编译器,手写解析器都是胜利者 精神X3在一个明确的秒 Spirit Qi与std_正则表达式的性能完全匹配,但在libc++上除外,因为那里的std_正则表达式非常慢 总结: 我建议使用Spirit或手动解析器,因为: 正则表达式实际上是一个需要维护的噩梦 所有这三种选择都会给您提供更有用的结果,因为转义序列实际上是被解释的,所以您不必再次处理它们 X3语法很容易维护 备选方案1:Spirit X3 如果您能负担得起使用需要C++14的实验性boost库,那么这是我个人最喜欢的。查看代码,您将看到原因:
void spiritX3(const std::string &payload, results_t &res) {
using namespace boost::spirit::x3;
auto escaping = [](auto&& set) { return ('\\' >> char_(set)) | (print - char_(set)); };
auto text = escaping(";\\");
symbols<unused_type> cmds;
cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";
auto address_ = *text;
auto command_ = raw [ cmds ];
auto client_ = *text;
auto param_ = *text;
auto value_ = *escaping(";:\\"); // note the ':'
auto error_ = *("'" >> char_(';') >> "'" | escaping(";\\"));
auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);
if (!parse(
payload.begin(), payload.end(),
address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
attr))
{
throw std::runtime_error("ERROR");
}
}
样本输出
使用libc++关闭配置clang3.6的输出:
---- parsed with regex:
address: zzz\\bbbb
command: L
client: babaa
param: bubu\;cc
value: vvvv
error: asd';'ff
---- parsed with manual parser (note: unescaping taken care of):
address: zzz\bbbb
command: L
client: babaa
param: bubu;cc
value: vvvv
error: asd;ff
---- parsed with spirit Qi (note: unescaping taken care of):
address: zzz\bbbb
command: L
client: babaa
param: bubu;cc
value: vvvv
error: asd;ff
clock resolution: mean is 16.9379 ns (40960002 iterations)
benchmarking std_regex
collecting 100 samples, 1 iterations each, in estimated 4.968 ms
mean: 15.2716 μs, lb 14.8763 μs, ub 16.1072 μs, ci 0.95
std dev: 2.81028 μs, lb 1668.21 ns, ub 5.63468 μs, ci 0.95
found 1 outliers among 100 samples (1%)
variance is severely inflated by outliers
benchmarking spirit Qi
collecting 100 samples, 7 iterations each, in estimated 1780.1 μs
mean: 2.15209 μs, lb 2.06754 μs, ub 2.22874 μs, ci 0.95
std dev: 412.372 ns, lb 369.921 ns, ub 453.462 ns, ci 0.95
found 0 outliers among 100 samples (0%)
variance is severely inflated by outliers
benchmarking manual
collecting 100 samples, 37 iterations each, in estimated 1705.7 μs
mean: 451.902 ns, lb 448.665 ns, ub 459.504 ns, ci 0.95
std dev: 23.7123 ns, lb 7.42683 ns, ub 41.7546 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers
除了其他基准测试问题外,您正在测试标准iLibrary组件,而不是编译器。这两台机器可能有不同的库版本。是的。这是libc++与libstdc++的对抗。不过,不需要不同的机器。提醒你,这是OP在做的,所以也许你是想在问题上发表你的评论?哎呀,我已经用std::set在我最初的postYep中纠正了这个问题,使用android进行评论是不精确的:-重点是仅仅将g++更改为clang++并不会改变你使用的库。@byteunit我已经更新了我的答案。基准测试有很多微妙之处,我衷心推荐令人敬畏的库。看起来,错误的正则表达式是错误的。它有。。。奇怪的括号。你真的是说|\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\那里回答:不,正则表达式没有错,左边的括号包含要知道的内容。你没有领会我的意思。如果你感兴趣,我有一个更易于维护且速度更快的替代实现:
#include <iostream>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>
#include <regex>
#define DEFAULT_REGEX \
R"(^((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)" \
R"((L|P|D|DN|R|W|LS|PS|RS|LU|PU|RU|LK|PK|RK|F);)" \
R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)" \
R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)" \
R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;|\\:)*))" \
R"((?:;((?:[^\x00-\x1F\x80-\xFF\\;])" \
R"(|\\\\|\\;|\';\')*))?$)"
struct results_t {
std::string address, command, client, param, value, error;
};
static const std::regex pattern{DEFAULT_REGEX};
void std_regex(const std::string &payload, results_t &res) {
std::smatch pieces;
bool matched = std::regex_match(payload, pieces, pattern);
if (!matched || pieces.size() != 7) {
throw std::runtime_error("ERROR");
}
res = { pieces[1], pieces[2], pieces[3], pieces[4], pieces[5], pieces[6] };
}
static std::string const payload = "zzz\\\\bbbb;L;babaa;bubu\\;cc;vvvv;asdff";
NONIUS_BENCHMARK("testcase", [](/*nonius::chronometer cm*/) {
results_t res;
std_regex(payload, res);
})
void spiritX3(const std::string &payload, results_t &res) {
using namespace boost::spirit::x3;
auto escaping = [](auto&& set) { return ('\\' >> char_(set)) | (print - char_(set)); };
auto text = escaping(";\\");
symbols<unused_type> cmds;
cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";
auto address_ = *text;
auto command_ = raw [ cmds ];
auto client_ = *text;
auto param_ = *text;
auto value_ = *escaping(";:\\"); // note the ':'
auto error_ = *("'" >> char_(';') >> "'" | escaping(";\\"));
auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);
if (!parse(
payload.begin(), payload.end(),
address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
attr))
{
throw std::runtime_error("ERROR");
}
}
void spiritQi(const std::string &payload, results_t &res) {
using namespace boost::spirit::qi;
#define ESCAPING(set) (('\\' >> char_(set)) | (print - char_(set)))
#define TEXT *ESCAPING(";\\")
symbols<char, unused_type> cmds;
cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";
using It = std::string::const_iterator;
rule<It, std::string()> address_ = TEXT;
rule<It, std::string()> command_ = raw [ cmds ];
rule<It, std::string()> client_ = TEXT;
rule<It, std::string()> param_ = TEXT;
rule<It, std::string()> value_ = *ESCAPING(";:\\"); // note the ':'
rule<It, std::string()> error_ = *("'" >> char_(';') >> "'" | ESCAPING(";\\"));
BOOST_SPIRIT_DEBUG_NODES((address_)(command_)(client_)(param_)(value_)(error_))
#undef TEXT
#undef ESCAPING
auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);
if (!parse(
payload.begin(), payload.end(),
address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
attr))
{
throw std::runtime_error("ERROR");
}
}
void manual(const std::string &payload, results_t &res) {
using It = std::string::const_iterator;
It it = payload.begin();
It const end = payload.end();
auto consume = [&](char const* escape_set, std::string& into, auto&& specials) {
while (it != end)
if (!specials(into)) switch (*it) {
case '\\':
if (++it != end && strchr(escape_set, *it))
into += *it++;
else
throw "invalid escape";
break;
default:
if (isprint(*it) && !strchr(escape_set, *it))
into += *it++;
else
return true;
}
return true;
};
auto escaping = [&](char const* escape_set, std::string& into) {
return consume(escape_set, into, [](std::string&) { return false; });
};
auto matched = [&](char const* what) {
auto saved = it;
auto wit = what;
while (*wit) {
if (it != end && *wit == *it)
{ ++wit; ++it; }
else {
it = saved;
// throw "expected: '" + std::string(what);
return false;
}
}
return true;
};
auto expect = [&](char const* what) {
if (!matched(what))
throw "expected: '" + std::string(what);
return true;
};
auto cmd = [&](std::string& into) {
static const char *const cmds[] = { "D", "DN", "F", "L", "LK", "LS", "LU", "P", "PK", "PS", "PU", "R", "RK", "RS", "RU", "W" };
for (auto cmd : cmds)
if (matched(cmd)) {
into.assign(cmd);
return true;
}
return false;
};
bool ok = escaping(";\\", res.address) && expect(";")
&& cmd(res.command) && expect(";")
&& escaping(";\\", res.client) && expect(";")
&& escaping(";\\", res.param) && expect(";")
&& escaping(":;\\", res.value);
auto squoted_semicolon = [&](std::string& into) {
if (!matched("';'"))
return false;
into += ';';
return true;
};
ok &= (it==end) || (expect(";") && consume(";\\", res.error, squoted_semicolon));
if (!ok)
throw std::runtime_error("ERROR");
}
---- parsed with regex:
address: zzz\\bbbb
command: L
client: babaa
param: bubu\;cc
value: vvvv
error: asd';'ff
---- parsed with manual parser (note: unescaping taken care of):
address: zzz\bbbb
command: L
client: babaa
param: bubu;cc
value: vvvv
error: asd;ff
---- parsed with spirit Qi (note: unescaping taken care of):
address: zzz\bbbb
command: L
client: babaa
param: bubu;cc
value: vvvv
error: asd;ff
clock resolution: mean is 16.9379 ns (40960002 iterations)
benchmarking std_regex
collecting 100 samples, 1 iterations each, in estimated 4.968 ms
mean: 15.2716 μs, lb 14.8763 μs, ub 16.1072 μs, ci 0.95
std dev: 2.81028 μs, lb 1668.21 ns, ub 5.63468 μs, ci 0.95
found 1 outliers among 100 samples (1%)
variance is severely inflated by outliers
benchmarking spirit Qi
collecting 100 samples, 7 iterations each, in estimated 1780.1 μs
mean: 2.15209 μs, lb 2.06754 μs, ub 2.22874 μs, ci 0.95
std dev: 412.372 ns, lb 369.921 ns, ub 453.462 ns, ci 0.95
found 0 outliers among 100 samples (0%)
variance is severely inflated by outliers
benchmarking manual
collecting 100 samples, 37 iterations each, in estimated 1705.7 μs
mean: 451.902 ns, lb 448.665 ns, ub 459.504 ns, ci 0.95
std dev: 23.7123 ns, lb 7.42683 ns, ub 41.7546 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers