C++ 考虑到特殊字符，将句子标记为单词_C++_String_Special Characters_Tokenize

C++ 考虑到特殊字符，将句子标记为单词

c++ string

C++ 考虑到特殊字符，将句子标记为单词,c++,string,special-characters,tokenize,C++,String,Special Characters,Tokenize,我有一个函数，它接收一个句子，并根据空格“”将其标记为单词。现在，我想改进函数以消除一些特殊字符，例如： I am a boy. => {I, am, a, boy}, no period after "boy" I said :"are you ok?" => {I, said, are, you, ok}, no question and quotation mark 原来的功能在这里，我如何改进它 void Tokenize(const string& st

我有一个函数，它接收一个句子，并根据空格“”将其标记为单词。现在，我想改进函数以消除一些特殊字符，例如：

I am a boy.   => {I, am, a, boy}, no period after "boy"
I said :"are you ok?"  => {I, said, are, you, ok}, no question and quotation mark

原来的功能在这里，我如何改进它

void Tokenize(const string& str, vector<string>& tokens, const string& delimiters = " ")
{

    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {

        tokens.push_back(str.substr(lastPos, pos - lastPos));

        lastPos = str.find_first_not_of(delimiters, pos);

        pos = str.find_first_of(delimiters, lastPos);
    }
}

void标记化（常量字符串和str、向量和标记、常量字符串和分隔符=”“）
{
字符串：：size\u type lastPos=str.find\u first\u not\u of（分隔符，0）；
字符串：：size\u type pos=str.find\u first\u of（分隔符，lastPos）；
while（string:：npos！=pos | | string:：npos！=lastPos）
{
回推（str.substr（lastPos，pos-lastPos））；
lastPos=str.find_first_not_of（分隔符，pos）；
pos=str.find_first_of（分隔符，lastPos）；
}
}

您可以使用

std:：regex

。在那里，你可以搜索任何你想要的，然后把结果放在一个向量中。这相当简单

见：

#包括
#包括
#包括
#包括
#包括
//我们的测试数据（原始字符串）。所以，也包含\“等等
string testData（R“#”（我说：“你还好吗？”）#；
std：：正则表达式re（R“#”（（\b\w+\b，）#”）；
内部主（空）
{
//将变量id定义为字符串的向量，并使用范围构造函数读取测试数据并将其标记化
std:：vector id{std:：sregex_token_迭代器（testData.begin（），testData.end（），re，1），std:：sregex_token_迭代器（）；
//用于调试输出。将完整向量打印到std:：cout
std：：copy（id.begin（）、id.end（）、std：：ostream_迭代器（std：：cout，“”）；
返回0；
}
在删除特殊字符时，将STR复制到STR2中。如果您想在STR2上做所有的操作，如果您想改进函数，首先需要定义您认为更好的。然后，我建议您编写测试，既适用于已经工作的情况，也适用于那些不如您希望的那样好的情况。然后，尝试改进。函数，如果您对此有任何具体问题，请在此处询问。目前看来，您似乎只是在这里找人为您编写。对于所有字符串操作，我建议boost spirit放弃手动操作、索引计算……但这是一种大锤，很难学习[链接]
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <regex>

// Our test data (raw string). So, containing also \" and so on
std::string testData(R"#(I said :"are you ok?")#");

std::regex re(R"#((\b\w+\b,?))#");

int main(void)
{
    // Define the variable id as vector of string and use the range constructor to read the test data and tokenize it
    std::vector<std::string> id{ std::sregex_token_iterator(testData.begin(), testData.end(), re, 1), std::sregex_token_iterator() };

    // For debug output. Print complete vector to std::cout
    std::copy(id.begin(), id.end(), std::ostream_iterator<std::string>(std::cout, " "));

    return 0;
}