C# 处理无效的XML十六进制字符

C# 处理无效的XML十六进制字符,c#,xml,.net-3.5,C#,Xml,.net 3.5,我试图通过网络发送XML文档,但收到以下异常: "MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character. at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch

我试图通过网络发送XML文档,但收到以下异常:

"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
   at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
   at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
   at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
   at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
   at System.Xml.XmlRawWriter.WriteValue(String value)
   at System.Xml.XmlWellFormedWriter.WriteValue(String value)
   at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
   --- End of inner exception stack trace ---
我无法控制我试图发送的内容,因为字符串是从电子邮件中收集的。如何对字符串进行编码,使其成为有效的XML,同时保留非法字符

我想以这样或那样的方式保留原始角色

byte[] toEncodeAsBytes
            = System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
      string returnValue
            = System.Convert.ToBase64String(toEncodeAsBytes);

以下代码从字符串中删除XML无效字符,并返回一个没有这些字符的新字符串:

public static string CleanInvalidXmlChars(string text) 
{ 
     // From xml spec valid chars: 
     // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
     // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
     string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]"; 
     return Regex.Replace(text, re, ""); 
}

我是@parapurajkumar解决方案的接收端,非法字符被正确加载到
XmlDocument
,但在我试图保存输出时中断了
XmlWriter

我的背景

我正在使用Elmah查看网站上的异常/错误日志。Elmah以大型XML文档的形式返回异常发生时服务器的状态。对于我们的报告引擎,我使用
XmlWriter
漂亮地打印XML

在一次网站攻击期间,我注意到一些XML没有进行解析,并接收到此
。“十六进制值0x00是无效字符。
异常

非解析:我将文档转换为
字节[]
并将其清除为0x00,但没有找到

扫描xml文档时,我发现以下内容:

...
<form>
...
<item name="SomeField">
   <value
     string="C:\boot.ini&#x0;.htm" />
 </item>
...

经验教训:如果传入数据在输入时是html编码的,则使用关联的html实体清除非法字节。

不能使用以下方法清除字符串:

System.Net.WebUtility.HtmlDecode()

以下解决方案将删除任何无效的XML字符,但我认为它会尽可能高效地删除这些字符,特别是,它不会分配新的StringBuilder和新字符串,除非已经确定字符串中有任何无效字符。因此,热点最终只是字符上的一个for循环,每个字符上的检查结果通常不超过两个大于/小于数字的比较。如果没有找到,它只返回原始字符串。这在绝大多数字符串刚开始时都很好的情况下尤其有用,最好尽快将它们作为输入和输出(没有浪费的alloc等)

--更新--

请参见下面的内容,您也可以直接编写包含这些无效字符的XElement,尽管它使用此代码-- 其中一些代码受到了影响。也可以在同一个帖子中看到作者的帖子中的有用信息。然而,所有这些都会实例化一个新的StringBuilder和string still

用法:

    string xmlStrBack = XML.ToValidXmlCharactersString("any string");
测试:

//---代码---(我在一个名为XML的静态实用程序类中有这些方法)

--富勒试验--

为我工作:

XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Encoding = Encoding.UTF8, CheckCharacters = false };

另一种删除C#中不正确XML字符的方法是使用(从.NET Framework 4.0开始提供)

.Net小提琴-


例如,垂直制表符(\v)对XML无效,它是有效的UTF-8,但不是有效的XML 1.0,甚至许多库(包括libxml2)都没有找到它并以静默方式输出无效的XML。

有一个通用解决方案可以很好地工作:

public class XmlTextTransformWriter : System.Xml.XmlTextWriter
{
    public XmlTextTransformWriter(System.IO.TextWriter w) : base(w) { }
    public XmlTextTransformWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { }
    public XmlTextTransformWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { }

    public Func<string, string> TextTransform = s => s;

    public override void WriteString(string text)
    {
        base.WriteString(TextTransform(text));
    }

    public override void WriteCData(string text)
    {
        base.WriteCData(TextTransform(text));
    }

    public override void WriteComment(string text)
    {
        base.WriteComment(TextTransform(text));
    }

    public override void WriteRaw(string data)
    {
        base.WriteRaw(TextTransform(data));
    }

    public override void WriteValue(string value)
    {
        base.WriteValue(TextTransform(value));
    }
}
其中XmlUtil.RemoveInvalidXmlChars的定义如下:

public class XmlRemoveInvalidCharacterWriter : XmlTextTransformWriter
{
    public XmlRemoveInvalidCharacterWriter(System.IO.TextWriter w) : base(w) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { SetTransform(); }

    void SetTransform()
    {
        TextTransform = XmlUtil.RemoveInvalidXmlChars;
    }
}
    public static string RemoveInvalidXmlChars(string content)
    {
        if (content.Any(ch => !System.Xml.XmlConvert.IsXmlChar(ch)))
            return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
        else
            return content;
    }

这取决于非法字符是像x0这样的XML根本无法处理的东西,还是像
这样的东西因为最后的x10FFFF没有转义而无法正常工作。查看此答案以获得更好的正则表达式:
CheckCharacters=true
设置中的设置为我完成了这项任务。谢谢我在哪里可以谈论它??[
    public static void TestXmlCleanser()
    {
        string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'

        XElement x = new XElement("test", badString);

        string xml1 = x.ToStringIgnoreInvalidChars();                               
        //result: <test>My name is Inigo Montoya</test>

        string xml2 = x.ToStringIgnoreInvalidChars(deleteInvalidChars: false);
        //result: <test>My name is Inigo Mont&#x1E;oya</test>
    }
    /// <summary>
    /// Writes this XML to string while allowing invalid XML chars to either be
    /// simply removed during the write process, or else encoded into entities, 
    /// instead of having an exception occur, as the standard XmlWriter.Create 
    /// XmlWriter does (which is the default writer used by XElement).
    /// </summary>
    /// <param name="xml">XElement.</param>
    /// <param name="deleteInvalidChars">True to have any invalid chars deleted, else they will be entity encoded.</param>
    /// <param name="indent">Indent setting.</param>
    /// <param name="indentChar">Indent char (leave null to use default)</param>
    public static string ToStringIgnoreInvalidChars(this XElement xml, bool deleteInvalidChars = true, bool indent = true, char? indentChar = null)
    {
        if (xml == null) return null;

        StringWriter swriter = new StringWriter();
        using (XmlTextWriterIgnoreInvalidChars writer = new XmlTextWriterIgnoreInvalidChars(swriter, deleteInvalidChars)) {

            // -- settings --
            // unfortunately writer.Settings cannot be set, is null, so we can't specify: bool newLineOnAttributes, bool omitXmlDeclaration
            writer.Formatting = indent ? Formatting.Indented : Formatting.None;

            if (indentChar != null)
                writer.IndentChar = (char)indentChar;

            // -- write --
            xml.WriteTo(writer); 
        }

        return swriter.ToString();
    }
public class XmlTextWriterIgnoreInvalidChars : XmlTextWriter
{
    public bool DeleteInvalidChars { get; set; }

    public XmlTextWriterIgnoreInvalidChars(TextWriter w, bool deleteInvalidChars = true) : base(w)
    {
        DeleteInvalidChars = deleteInvalidChars;
    }

    public override void WriteString(string text)
    {
        if (text != null && DeleteInvalidChars)
            text = XML.ToValidXmlCharactersString(text);
        base.WriteString(text);
    }
}
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Encoding = Encoding.UTF8, CheckCharacters = false };
public static string RemoveInvalidXmlChars(string content)
{
   return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
public class XmlTextTransformWriter : System.Xml.XmlTextWriter
{
    public XmlTextTransformWriter(System.IO.TextWriter w) : base(w) { }
    public XmlTextTransformWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { }
    public XmlTextTransformWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { }

    public Func<string, string> TextTransform = s => s;

    public override void WriteString(string text)
    {
        base.WriteString(TextTransform(text));
    }

    public override void WriteCData(string text)
    {
        base.WriteCData(TextTransform(text));
    }

    public override void WriteComment(string text)
    {
        base.WriteComment(TextTransform(text));
    }

    public override void WriteRaw(string data)
    {
        base.WriteRaw(TextTransform(data));
    }

    public override void WriteValue(string value)
    {
        base.WriteValue(TextTransform(value));
    }
}
public class XmlRemoveInvalidCharacterWriter : XmlTextTransformWriter
{
    public XmlRemoveInvalidCharacterWriter(System.IO.TextWriter w) : base(w) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { SetTransform(); }

    void SetTransform()
    {
        TextTransform = XmlUtil.RemoveInvalidXmlChars;
    }
}
    public static string RemoveInvalidXmlChars(string content)
    {
        if (content.Any(ch => !System.Xml.XmlConvert.IsXmlChar(ch)))
            return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
        else
            return content;
    }