C++ 在C+;中使用类似Fortran的格式迭代文本文件+;

C++ 在C+;中使用类似Fortran的格式迭代文本文件+;,c++,parsing,fortran,text-parsing,C++,Parsing,Fortran,Text Parsing,我正在制作一个处理txt文件数据的应用程序 P>的思想是TXT文件可以采用不同的格式,并且应该被读取到C++中。 一个例子可能是3I2,3X,I3,应该这样做:“首先我们有3个长度为2的整数,然后我们有3个空点,然后我们有1个长度为3的整数 最好是迭代文件,生成行,然后将行作为字符串进行迭代?哪种有效的迭代方法可以巧妙地忽略要忽略的3个点 例如 致: 可以用SCANF格式翻译 3I2、3X、I3 .< /P> < P> Kyle Kanos给出的链接是一个好的链接;*Snff/*Prtuf格式

我正在制作一个处理txt文件数据的应用程序

<> P>的思想是TXT文件可以采用不同的格式,并且应该被读取到C++中。 一个例子可能是
3I2,3X,I3
,应该这样做:“首先我们有3个长度为2的整数,然后我们有3个空点,然后我们有1个长度为3的整数

最好是迭代文件,生成行,然后将行作为字符串进行迭代?哪种有效的迭代方法可以巧妙地忽略要忽略的3个点

例如

致:


可以用SCANF格式翻译<代码> 3I2、3X、I3<代码> .< /P> < P> Kyle Kanos给出的链接是一个好的链接;*Snff/*Prtuf格式字符串映射到FORTRAN格式字符串上。使用C风格IO实际上更容易做到这一点,但是使用C++风格的流也是可行的:

#include <cstdio>
#include <iostream>
#include <fstream>
#include <string>

int main() {
    std::ifstream fortranfile;
    fortranfile.open("input.txt");

    if (fortranfile.is_open()) {

        std::string line;
        getline(fortranfile, line);

        while (fortranfile.good()) {
            char dummy[4];
            int i1, i2, i3, i4;

            sscanf(line.c_str(), "%2d%2d%2d%3s%3d", &i1, &i2, &i3, dummy, &i4);

            std::cout << "Line: '" << line << "' -> " << i1 << " " << i2 << " "
                      << i3 << " " << i4 << std::endl;

            getline(fortranfile, line);
        }
    }

    fortranfile.close();

    return 0;
}

这里我们使用的格式字符串是
%2d%2d%2d%3s%3d
-3份
%2d
(宽度为2的十进制整数),然后是
%3s
(宽度为3的字符串,我们将其读入从未使用过的变量),然后是
%3d
(宽度为3的十进制整数).

鉴于Fortran很容易从C中调用,您可以编写一个小小的Fortran函数来“本机”执行此操作。毕竟,Fortran READ函数采用您描述的格式字符串

如果你想让它工作,你需要稍微修改一下FORTRAN,然后学习如何用编译器连接Fortran和C++。

  • Fortran符号的后缀可以隐式地加下划线,因此可以从C调用MYFUNC作为MYFUNC \()
  • 多维数组的维数顺序相反
  • < L>在C++头中声明FORTRAN(或C)函数需要将其放置在<代码>外部“C”{} /代码>作用域中。
如果您的用户实际上应该以Fortran格式输入数据,或者如果您非常快地修改或编写Fortran代码来实现这一点,我会按照John Zwinck和M.S.B.的建议执行。只需编写一个简短的Fortran例程将数据读入数组,然后使用“绑定(c)”“和ISO_C_绑定类型来设置接口。请记住,数组索引在FORTRAN和C++之间会发生变化。 否则,我建议使用scanf,如上所述:

如果您不知道每行需要读取的项目数,则可以使用vscanf:


然而,尽管它看起来很方便,但我从未使用过它,所以YMMV。

鉴于您希望,您应该注意:您立即进入了解析器的领域。

除了其他人在此处提到的解析此类输入的其他方法外:

  • 通过使用Fortran和CC/+++绑定为您进行解析
  • 使用纯C++为您解析语法,使用以下组合:
    • sscanf
我的建议是,如果您可以使用它,您可以使用regex和STL容器的组合,为动态操作实现一个简单的解析器

根据您所描述的内容以及在不同位置显示的内容,您可以使用正则表达式捕获来构造您希望支持的语法的简单实现:

(\\d{0,8})([[:alpha:]])(\\d{0,8})
  • 其中,第一组是该变量类型的编号
  • 第二个是变量的类型
  • 第三是变量类型的长度
  • 使用,您可以实现一个简单的解决方案,如下所示:

    #include <iostream>
    #include <string>
    #include <vector>
    #include <fstream>
    #include <cstdlib>
    #include <boost/regex.hpp>
    #include <boost/tokenizer.hpp>
    #include <boost/algorithm/string.hpp>
    #include <boost/lexical_cast.hpp>
    
    //A POD Data Structure used for storing Fortran Format Tokens into their relative forms
    typedef struct FortranFormatSpecifier {
        char type;//the type of the variable
        size_t number;//the number of times the variable is repeated
        size_t length;//the length of the variable type
    } FFlag;
    
    //This class implements a rudimentary parser to parse Fortran Format
    //Specifier Flags using Boost regexes.
    class FormatParser {
    public:
        //typedefs for further use with the class and class methods
        typedef boost::tokenizer<boost::char_separator<char> > bst_tokenizer;
        typedef std::vector<std::vector<std::string> > vvstr;
        typedef std::vector<std::string> vstr;
        typedef std::vector<std::vector<int> > vvint;
        typedef std::vector<int> vint;
    
        FormatParser();
        FormatParser(const std::string& fmt, const std::string& fname);
    
        void parse();
        void printIntData();
        void printCharData();
    
    private:
        bool validateFmtString();
        size_t determineOccurence(const std::string& numStr);
        FFlag setFortranFmtArgs(const boost::smatch& matches);
        void parseAndStore(const std::string& line);
        void storeData();
    
        std::string mFmtStr;                //this holds the format string
        std::string mFilename;              //the name of the file
    
        FFlag mFmt;                         //a temporary FFlag variable
        std::vector<FFlag> mFortranVars;    //this holds all the flags and details of them
        std::vector<std::string> mRawData;  //this holds the raw tokens
    
        //this is where you will hold all the types of data you wish to support
        vvint mIntData;                     //this holds all the int data
        vvstr mCharData;                    //this holds all the character data (stored as strings for convenience)
    };
    
    FormatParser::FormatParser() : mFmtStr(), mFilename(), mFmt(), mFortranVars(), mRawData(), mIntData(), mCharData() {}
    FormatParser::FormatParser(const std::string& fmt, const std::string& fname) : mFmtStr(fmt), mFilename(fname), mFmt(), mFortranVars(), mRawData(), mIntData(), mCharData() {}
    
    //this function determines the number of times that a variable occurs
    //by parsing a numeric string and returning the associated output
    //based on the grammar
    size_t FormatParser::determineOccurence(const std::string& numStr) {
        size_t num = 0;
        //this case means that no number was supplied in front of the type
        if (numStr.empty()) {
            num = 1;//hence, the default is 1
        }
        else {
            //attempt to parse the numeric string and find it's equivalent
            //integer value (since all occurences are whole numbers)
            size_t n = atoi(numStr.c_str());
    
            //this case covers if the numeric string is expicitly 0
            //hence, logically, it doesn't occur, set the value accordingly
            if (n == 0) {
                num = 0;
            }
            else {
                //set the value to its converted representation
                num = n;
            }
        }
        return num;
    }
    
    //from the boost::smatches, determine the set flags, store them
    //and return it
    FFlag FormatParser::setFortranFmtArgs(const boost::smatch& matches) {
        FFlag ffs = {0};
    
        std::string fmt_number, fmt_type, fmt_length;
    
        fmt_number = matches[1];
        fmt_type = matches[2];
        fmt_length = matches[3];
    
        ffs.type = fmt_type.c_str()[0];
    
        ffs.number = determineOccurence(fmt_number);
        ffs.length = determineOccurence(fmt_length);
    
        return ffs;
    }
    
    //since the format string is CSV, split the string into tokens
    //and then, validate the tokens by attempting to match them
    //to the grammar (implemented as a simple regex). If the number of
    //validations match, everything went well: return true. Otherwise:
    //return false.
    bool FormatParser::validateFmtString() {    
        boost::char_separator<char> sep(",");
        bst_tokenizer tokens(mFmtStr, sep);
        mFmt = FFlag();
    
        size_t n_tokens = 0;
        std::string token;
    
        for(bst_tokenizer::const_iterator it = tokens.begin(); it != tokens.end(); ++it) {
            token = *it;
            boost::trim(token);
    
            //this "grammar" is based on the Fortran Format Flag Specification
            std::string rgx = "(\\d{0,8})([[:alpha:]])(\\d{0,8})";
            boost::regex re(rgx);
            boost::smatch matches;
    
            if (boost::regex_match(token, matches, re, boost::match_extra)) {
                mFmt = setFortranFmtArgs(matches);
                mFortranVars.push_back(mFmt);
            }
            ++n_tokens;
        }
    
        return mFortranVars.size() != n_tokens ? false : true;
    }
    
    //Now, parse each input line from a file and try to parse and store
    //those variables into their associated containers.
    void FormatParser::parseAndStore(const std::string& line) {
        int offset = 0;
        int integer = 0;
        std::string varData;
        std::vector<int> intData;
        std::vector<std::string> charData;
    
        offset = 0;
    
        for (std::vector<FFlag>::const_iterator begin = mFortranVars.begin(); begin != mFortranVars.end(); ++begin) {
            mFmt = *begin;
    
            for (size_t i = 0; i < mFmt.number; offset += mFmt.length, ++i) {
                varData = line.substr(offset, mFmt.length);
    
                //now store the data, based on type:
                switch(mFmt.type) {
                    case 'X':
                      break;
    
                    case 'A':
                      charData.push_back(varData);
                      break;
    
                    case 'I':
                      integer = atoi(varData.c_str());
                      intData.push_back(integer);
                      break;
    
                    default:
                      std::cerr << "Invalid type!\n";
                }
            }
        }
        mIntData.push_back(intData);
        mCharData.push_back(charData);
    }
    
    //Open the input file, and attempt to parse the input file line-by-line.
    void FormatParser::storeData() {
        mFmt = FFlag();
        std::ifstream ifile(mFilename.c_str(), std::ios::in);
        std::string line;
    
        if (ifile.is_open()) {
            while(std::getline(ifile, line)) {
                parseAndStore(line);
            }
        }
        else {
            std::cerr << "Error opening input file!\n";
            exit(3);
        }
    }
    
    //If character flags are set, this function will print the character data
    //found, line-by-line
    void FormatParser::printCharData() {    
        vvstr::const_iterator it = mCharData.begin();
        vstr::const_iterator jt;
        size_t linenum = 1;
    
        std::cout << "\nCHARACTER DATA:\n";
    
        for (; it != mCharData.end(); ++it) {
            std::cout << "LINE " << linenum << " : ";
            for (jt = it->begin(); jt != it->end(); ++jt) {
                std::cout << *jt << " ";
            }
            ++linenum;
            std::cout << "\n";
        }
    }
    
    //If integer flags are set, this function will print all the integer data
    //found, line-by-line
    void FormatParser::printIntData() {
        vvint::const_iterator it = mIntData.begin();
        vint::const_iterator jt;
        size_t linenum = 1;
    
        std::cout << "\nINT DATA:\n";
    
        for (; it != mIntData.end(); ++it) {
            std::cout << "LINE " << linenum << " : ";
            for (jt = it->begin(); jt != it->end(); ++jt) {
                std::cout << *jt << " ";
            }
            ++linenum;
            std::cout << "\n";
        }
    }
    
    //Attempt to parse the input file, by first validating the format string
    //and then, storing the data accordingly
    void FormatParser::parse() {
        if (!validateFmtString()) {
            std::cerr << "Error parsing the input format string!\n";
            exit(2);
        }
        else {
            storeData();
        }
    }
    
    int main(int argc, char **argv) {
        if (argc < 3 || argc > 3) {
            std::cerr << "Usage: " << argv[0] << "\t<Fortran Format Specifier(s)>\t<Filename>\n";
            exit(1);
        }
        else {
            //parse and print stuff here
            FormatParser parser(argv[1], argv[2]);
            parser.parse();
    
            //print the data parsed (if any)
            parser.printIntData();
            parser.printCharData();
        }
        return 0;
    }
    
    奖金

    这个基本解析器也可以处理
    字符
    (Fortran格式标志“A”,最多8个字符)。通过编辑正则表达式并与类型一起对捕获字符串的长度执行检查,您可以扩展它以支持任何您想要的标志。

    可能的改进

    如果您可以使用C++11,您可以在某些地方使用
    lambdas
    ,并用
    auto
    替换迭代器

    如果这是在有限的内存空间中运行的,并且您必须解析一个大文件,那么由于
    vectors
    内部管理内存的方式,vectors将不可避免地崩溃。最好使用
    deques
    。有关这方面的更多信息,请参见此处讨论的内容:

    而且,如果输入文件很大,并且文件I/O是一个瓶颈,则可以通过修改
    ifstream
    缓冲区的大小来提高性能:

    讨论

    您将注意到:您正在解析的类型必须在运行时已知,并且类声明和定义中必须支持任何关联的存储容器。

    正如您所想象的,在一个主类中支持所有类型是没有效率的。但是,由于这是一个幼稚的解决方案,可以专门使用改进的完整解决方案来支持这些情况

    另一个建议是使用。但是,由于Spirit使用了大量模板,当错误可能而且确实发生时,调试这样的应用程序并不适合胆小的人

    演出

    与,相比,此解决方案速度较慢

    对于10000000行随机生成的输出(124MiB文件),使用相同的行格式(“3I2,3X,I3”):

    平均壁时间为
    12.946s

    Jonathan Dursi的解决方案:

    0m13.082s
    0m13.107s
    0m12.793s
    0m12.851s
    0m12.801s
    0m12.968s
    0m12.952s
    0m12.886s
    0m13.138s
    0m12.882s
    
    0m4.698s
    0m4.650s
    0m4.690s
    0m4.675s
    0m4.682s
    0m4.681s
    0m4.698s
    0m4.675s
    0m4.695s
    0m4.696s
    
    平均壁时间
    4.684s的火焰

    他的速度比我的速度快至少270%,同时使用O2

    但是,由于不必每次解析附加格式标志时都修改源代码,因此此解决方案更为理想

    注意:您可以实施涉及
    #include <iostream>
    #include <string>
    #include <vector>
    #include <fstream>
    #include <cstdlib>
    #include <boost/regex.hpp>
    #include <boost/tokenizer.hpp>
    #include <boost/algorithm/string.hpp>
    #include <boost/lexical_cast.hpp>
    
    //A POD Data Structure used for storing Fortran Format Tokens into their relative forms
    typedef struct FortranFormatSpecifier {
        char type;//the type of the variable
        size_t number;//the number of times the variable is repeated
        size_t length;//the length of the variable type
    } FFlag;
    
    //This class implements a rudimentary parser to parse Fortran Format
    //Specifier Flags using Boost regexes.
    class FormatParser {
    public:
        //typedefs for further use with the class and class methods
        typedef boost::tokenizer<boost::char_separator<char> > bst_tokenizer;
        typedef std::vector<std::vector<std::string> > vvstr;
        typedef std::vector<std::string> vstr;
        typedef std::vector<std::vector<int> > vvint;
        typedef std::vector<int> vint;
    
        FormatParser();
        FormatParser(const std::string& fmt, const std::string& fname);
    
        void parse();
        void printIntData();
        void printCharData();
    
    private:
        bool validateFmtString();
        size_t determineOccurence(const std::string& numStr);
        FFlag setFortranFmtArgs(const boost::smatch& matches);
        void parseAndStore(const std::string& line);
        void storeData();
    
        std::string mFmtStr;                //this holds the format string
        std::string mFilename;              //the name of the file
    
        FFlag mFmt;                         //a temporary FFlag variable
        std::vector<FFlag> mFortranVars;    //this holds all the flags and details of them
        std::vector<std::string> mRawData;  //this holds the raw tokens
    
        //this is where you will hold all the types of data you wish to support
        vvint mIntData;                     //this holds all the int data
        vvstr mCharData;                    //this holds all the character data (stored as strings for convenience)
    };
    
    FormatParser::FormatParser() : mFmtStr(), mFilename(), mFmt(), mFortranVars(), mRawData(), mIntData(), mCharData() {}
    FormatParser::FormatParser(const std::string& fmt, const std::string& fname) : mFmtStr(fmt), mFilename(fname), mFmt(), mFortranVars(), mRawData(), mIntData(), mCharData() {}
    
    //this function determines the number of times that a variable occurs
    //by parsing a numeric string and returning the associated output
    //based on the grammar
    size_t FormatParser::determineOccurence(const std::string& numStr) {
        size_t num = 0;
        //this case means that no number was supplied in front of the type
        if (numStr.empty()) {
            num = 1;//hence, the default is 1
        }
        else {
            //attempt to parse the numeric string and find it's equivalent
            //integer value (since all occurences are whole numbers)
            size_t n = atoi(numStr.c_str());
    
            //this case covers if the numeric string is expicitly 0
            //hence, logically, it doesn't occur, set the value accordingly
            if (n == 0) {
                num = 0;
            }
            else {
                //set the value to its converted representation
                num = n;
            }
        }
        return num;
    }
    
    //from the boost::smatches, determine the set flags, store them
    //and return it
    FFlag FormatParser::setFortranFmtArgs(const boost::smatch& matches) {
        FFlag ffs = {0};
    
        std::string fmt_number, fmt_type, fmt_length;
    
        fmt_number = matches[1];
        fmt_type = matches[2];
        fmt_length = matches[3];
    
        ffs.type = fmt_type.c_str()[0];
    
        ffs.number = determineOccurence(fmt_number);
        ffs.length = determineOccurence(fmt_length);
    
        return ffs;
    }
    
    //since the format string is CSV, split the string into tokens
    //and then, validate the tokens by attempting to match them
    //to the grammar (implemented as a simple regex). If the number of
    //validations match, everything went well: return true. Otherwise:
    //return false.
    bool FormatParser::validateFmtString() {    
        boost::char_separator<char> sep(",");
        bst_tokenizer tokens(mFmtStr, sep);
        mFmt = FFlag();
    
        size_t n_tokens = 0;
        std::string token;
    
        for(bst_tokenizer::const_iterator it = tokens.begin(); it != tokens.end(); ++it) {
            token = *it;
            boost::trim(token);
    
            //this "grammar" is based on the Fortran Format Flag Specification
            std::string rgx = "(\\d{0,8})([[:alpha:]])(\\d{0,8})";
            boost::regex re(rgx);
            boost::smatch matches;
    
            if (boost::regex_match(token, matches, re, boost::match_extra)) {
                mFmt = setFortranFmtArgs(matches);
                mFortranVars.push_back(mFmt);
            }
            ++n_tokens;
        }
    
        return mFortranVars.size() != n_tokens ? false : true;
    }
    
    //Now, parse each input line from a file and try to parse and store
    //those variables into their associated containers.
    void FormatParser::parseAndStore(const std::string& line) {
        int offset = 0;
        int integer = 0;
        std::string varData;
        std::vector<int> intData;
        std::vector<std::string> charData;
    
        offset = 0;
    
        for (std::vector<FFlag>::const_iterator begin = mFortranVars.begin(); begin != mFortranVars.end(); ++begin) {
            mFmt = *begin;
    
            for (size_t i = 0; i < mFmt.number; offset += mFmt.length, ++i) {
                varData = line.substr(offset, mFmt.length);
    
                //now store the data, based on type:
                switch(mFmt.type) {
                    case 'X':
                      break;
    
                    case 'A':
                      charData.push_back(varData);
                      break;
    
                    case 'I':
                      integer = atoi(varData.c_str());
                      intData.push_back(integer);
                      break;
    
                    default:
                      std::cerr << "Invalid type!\n";
                }
            }
        }
        mIntData.push_back(intData);
        mCharData.push_back(charData);
    }
    
    //Open the input file, and attempt to parse the input file line-by-line.
    void FormatParser::storeData() {
        mFmt = FFlag();
        std::ifstream ifile(mFilename.c_str(), std::ios::in);
        std::string line;
    
        if (ifile.is_open()) {
            while(std::getline(ifile, line)) {
                parseAndStore(line);
            }
        }
        else {
            std::cerr << "Error opening input file!\n";
            exit(3);
        }
    }
    
    //If character flags are set, this function will print the character data
    //found, line-by-line
    void FormatParser::printCharData() {    
        vvstr::const_iterator it = mCharData.begin();
        vstr::const_iterator jt;
        size_t linenum = 1;
    
        std::cout << "\nCHARACTER DATA:\n";
    
        for (; it != mCharData.end(); ++it) {
            std::cout << "LINE " << linenum << " : ";
            for (jt = it->begin(); jt != it->end(); ++jt) {
                std::cout << *jt << " ";
            }
            ++linenum;
            std::cout << "\n";
        }
    }
    
    //If integer flags are set, this function will print all the integer data
    //found, line-by-line
    void FormatParser::printIntData() {
        vvint::const_iterator it = mIntData.begin();
        vint::const_iterator jt;
        size_t linenum = 1;
    
        std::cout << "\nINT DATA:\n";
    
        for (; it != mIntData.end(); ++it) {
            std::cout << "LINE " << linenum << " : ";
            for (jt = it->begin(); jt != it->end(); ++jt) {
                std::cout << *jt << " ";
            }
            ++linenum;
            std::cout << "\n";
        }
    }
    
    //Attempt to parse the input file, by first validating the format string
    //and then, storing the data accordingly
    void FormatParser::parse() {
        if (!validateFmtString()) {
            std::cerr << "Error parsing the input format string!\n";
            exit(2);
        }
        else {
            storeData();
        }
    }
    
    int main(int argc, char **argv) {
        if (argc < 3 || argc > 3) {
            std::cerr << "Usage: " << argv[0] << "\t<Fortran Format Specifier(s)>\t<Filename>\n";
            exit(1);
        }
        else {
            //parse and print stuff here
            FormatParser parser(argv[1], argv[2]);
            parser.parse();
    
            //print the data parsed (if any)
            parser.printIntData();
            parser.printCharData();
        }
        return 0;
    }
    
    g++ -Wall -std=c++98 -pedantic fortran_format_parser.cpp -lboost_regex
    
    #include <fstream>
    #include <cstdlib>
    #include <ctime>
    using namespace std;
    
    int main(int argc, char **argv) {
        srand(time(NULL));
        if (argc < 2 || argc > 2) {
            printf("Invalid usage! Use as follows:\t<Program>\t<Output Filename>\n");
            exit(1);
        }
    
        ofstream ofile(argv[1], ios::out);
        if (ofile.is_open()) {
            for (int i = 0; i < 10000000; ++i) {
                 ofile << (rand() % (99-10+1) + 10) << (rand() % (99-10+1) + 10) << (rand() % (99-10+1)+10) << "---" << (rand() % (999-100+1) + 100) << endl;
            }
        }
    
        ofile.close();
        return 0;
    }
    
    0m13.082s
    0m13.107s
    0m12.793s
    0m12.851s
    0m12.801s
    0m12.968s
    0m12.952s
    0m12.886s
    0m13.138s
    0m12.882s
    
    0m4.698s
    0m4.650s
    0m4.690s
    0m4.675s
    0m4.682s
    0m4.681s
    0m4.698s
    0m4.675s
    0m4.695s
    0m4.696s
    
    cin >> f77format("3I2, 3X, I3") >> a >> b >> c >> d;