从C#中的字符串解析化学式？_C#_Parsing_Chemistry

从C#中的字符串解析化学式？

c# parsing

从C#中的字符串解析化学式？,c#,parsing,chemistry,C#,Parsing,Chemistry,我试图从字符串中解析C中的化学式（格式为：Al2O3或O3或C或C11H22O12）。除非特定元素只有一个原子（例如，H2O中的氧原子），否则其工作正常。我怎样才能解决这个问题，另外，有没有比我更好的方法来解析化学式字符串 ChemicalElement是一个表示化学元素的类。它具有属性AtomicNumber（int）、名称（string）、符号（string）。 ChemicalFormulaComponent是表示化学元素和原子计数的类（例如，公式的一部分）。它具有属性元素（化学元素）、原

我试图从字符串中解析C中的化学式（格式为：

Al2O3

或

O3

或

C11H22O12

）。除非特定元素只有一个原子（例如，

H2O

中的氧原子），否则其工作正常。我怎样才能解决这个问题，另外，有没有比我更好的方法来解析化学式字符串

ChemicalElement是一个表示化学元素的类。它具有属性AtomicNumber（int）、名称（string）、符号（string）。 ChemicalFormulaComponent是表示化学元素和原子计数的类（例如，公式的一部分）。它具有属性元素（化学元素）、原子计数（int）

其余的应该足够清楚，以理解（我希望），但请让我知道与评论，如果我可以澄清任何事情，在你回答之前

这是我目前的代码：

    /// <summary>
    /// Parses a chemical formula from a string.
    /// </summary>
    /// <param name="chemicalFormula">The string to parse.</param>
    /// <exception cref="FormatException">The chemical formula was in an invalid format.</exception>
    public static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
    {
        Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();

        string nameBuffer = string.Empty;
        int countBuffer = 0;

        for (int i = 0; i < chemicalFormula.Length; i++)
        {
            char c = chemicalFormula[i];

            if (!char.IsLetterOrDigit(c) || !char.IsUpper(chemicalFormula, 0))
            {
                throw new FormatException("Input string was in an incorrect format.");
            }
            else if (char.IsUpper(c))
            {
                // Add the chemical element and its atom count
                if (countBuffer > 0)
                {
                    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

                    // Reset
                    nameBuffer = string.Empty;
                    countBuffer = 0;
                }

                nameBuffer += c;
            }
            else if (char.IsLower(c))
            {
                nameBuffer += c;
            }
            else if (char.IsDigit(c))
            {
                if (countBuffer == 0)
                {
                    countBuffer = c - '0';
                }
                else
                {
                    countBuffer = (countBuffer * 10) + (c - '0');
                }
            }
        }

        return formula;
    }

//
///从字符串中解析化学式。
/// 
///要分析的字符串。
///化学式的格式无效。
公共静态集合FormulaFromString（字符串化学公式）
{
集合公式=新集合（）；
string nameBuffer=string.Empty；
int countBuffer=0；
对于（int i=0；i0）
{
添加（新的ChemicalFormulaComponent（ChemicalElement.ElementFromSymbol（nameBuffer），countBuffer））；
//重置
nameBuffer=string.Empty；
countBuffer=0；
}
nameBuffer+=c；
}
else if（字符孤岛（c））
{
nameBuffer+=c；
}
else if（字符IsDigit（c））
{
如果（countBuffer==0）
{
countBuffer=c-“0”；
}
其他的
{
countBuffer=（countBuffer*10）+（c-“0”）；
}
}
}
回报公式；
}

我使用正则表达式重写了解析器。正则表达式非常适合您所做的事情。希望这有帮助

public static void Main(string[] args)
{
    var testCases = new List<string>
    {
        "C11H22O12",
        "Al2O3",
        "O3",
        "C",
        "H2O"
    };

    foreach (string testCase in testCases)
    {
        Console.WriteLine("Testing {0}", testCase);

        var formula = FormulaFromString(testCase);

        foreach (var element in formula)
        {
            Console.WriteLine("{0} : {1}", element.Element, element.Count);
        }
        Console.WriteLine();
    }

    /* Produced the following output

    Testing C11H22O12
    C : 11
    H : 22
    O : 12

    Testing Al2O3
    Al : 2
    O : 3

    Testing O3
    O : 3

    Testing C
    C : 1

    Testing H2O
    H : 2
    O : 1
        */
}

private static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
{
    Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();
    string elementRegex = "([A-Z][a-z]*)([0-9]*)";
    string validateRegex = "^(" + elementRegex + ")+$";

    if (!Regex.IsMatch(chemicalFormula, validateRegex))
        throw new FormatException("Input string was in an incorrect format.");

    foreach (Match match in Regex.Matches(chemicalFormula, elementRegex))
    {
        string name = match.Groups[1].Value;

        int count =
            match.Groups[2].Value != "" ?
            int.Parse(match.Groups[2].Value) :
            1;

        formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(name), count));
    }

    return formula;
}

publicstaticvoidmain（字符串[]args）
{
var testCases=新列表
{
“C11H22O12”，
“Al2O3”，
“O3”，
“C”，
“H2O”
};
foreach（testCases中的字符串testCase）
{
WriteLine（“测试{0}”，testCase）；
var公式=FormulaFromString（testCase）；
foreach（公式中的var元素）
{
WriteLine（“{0}:{1}”，element.element，element.Count）；
}
Console.WriteLine（）；
}
/*产生了以下输出
测试C11H22O12
C:11
H:22
O:12
Al2O3的测试
Al:2
O:3
测试臭氧
O:3
测试C
C:1
测试水
H:2
O:1
*/
}
私有静态集合FormulaFromString（字符串化学公式）
{
集合公式=新集合（）；
string elementRegex=“（[A-Z][A-Z]*）（[0-9]*）”；
字符串validateregx=“^（“+elementRegex+”）+$”；
如果（！Regex.IsMatch（化学式，validateRegex））
抛出新FormatException（“输入字符串的格式不正确。”）；
foreach（正则表达式中的匹配。匹配（化学公式，elementRegex））
{
字符串名称=匹配。组[1]。值；
整数计数=
匹配。组[2]。值！=“”？
int.Parse（match.Groups[2].Value）：
1.
添加（新的化学公式组分（ChemicalElement.ElementFromSymbol（名称），计数））；
}
回报公式；
}

您的方法存在以下问题：

            // Add the chemical element and its atom count
            if (countBuffer > 0)

当您没有数字时，计数缓冲区将为0，我认为这将起作用

            // Add the chemical element and its atom count
            if (countBuffer > 0 || nameBuffer != String.Empty)

对于HO2之类的公式，这将起作用。我相信您的方法永远不会在

公式

集合中插入化学公式的las元素

在返回结果之前，应将bufer的最后一个元素添加到集合中，如下所示：

    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

    return formula;
}

首先：我没有在.net中使用语法分析器生成器，但我很确定您可以找到合适的语法分析器生成器。这将允许您以更可读的形式编写化学公式的语法。请参见第一次启动的示例

如果您想保持您的方法：是否有可能不添加最后一个元素，不管它是否有数字？您可能希望使用

i0）

条件运行循环，因为countBuffer实际上可以为零

如果您想拆分以下内容，正则表达式应该可以与简单公式配合使用：

(Zn2(Ca(BrO4))K(Pb)2Rb)3

使用解析器可能更容易（因为复合嵌套）。任何解析器都应该能够处理它

几天前我发现了这个问题，我认为这是一个很好的例子，可以为解析器编写语法，所以我将简单的化学公式语法包括在我的套件中。键的规则是--for lexer:

"(" -> LPAREN; ")" -> RPAREN; /[0-9]+/ -> NUM, Convert.ToInt32($text); /[A-Z][a-z]*/ -> ATOM;
对于解析器：

comp -> e:elem { e }; elem -> LPAREN e:elem RPAREN n:NUM? { new Element(e,$(n : 1)) } | e:elem++ { new Element(e,1) } | a:ATOM n:NUM? { new Element(a,$(n : 1)) } ;

为什么要在
循环（！char.IsUpper（chemicalFormula，0））的每次迭代中检查公式的第一个字符是否为大写？这里的索引始终是0 。我认为您的函数也存在类似C4O2的问题。这是真的吗？另请参见页面。它要求用Java编写一个，用Python编写一个答案，并链接到更复杂的ANTLR和Python解决方案。但是旁注-不应该是*n吗