Asp.net 如何使XmlDocument与没有引用属性的XML一起工作?

Asp.net 如何使XmlDocument与没有引用属性的XML一起工作?,asp.net,regex,vb.net,xmldocument,Asp.net,Regex,Vb.net,Xmldocument,我有一个asp.net vb项目,需要解析来自数据库的一些原始XML。XML的布局如下: <HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.&

我有一个asp.net vb项目,需要解析来自数据库的一些原始XML。XML的布局如下:

<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006&ndash;&nbsp;&nbsp;</A>; Postdoctoral Fellow, Toronto Western Hosp. 2000&ndash;06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. &amp; Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>
string html = "<html><head><TITLE>title</TITLE></head><body>I♥NY<p>b<br>c:±<img src=2 nonsense=x></a><font size=2>c</font></body></html>";

var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace: true);

Console.WriteLine(xdoc);

我知道错误是因为没有引用属性。在将字符串加载到xmldoc之前,是否有任何方法可以让XmlDocument理解属性,或者有一种简单的方法可以使用reg表达式向属性添加引号?

您拥有的是无效的XML。XmlDocument要求输入是有效的XML。我建议您使用HTML解析器,例如,以便解析HTML(这是您的输入)。因此,例如,如果您想列出所有锚定的所有
name
属性值,它就这么简单:

using System;
using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        var document = new HtmlDocument();
        document.Load("test.html");
        foreach (var a in document.DocumentNode.Descendants("a"))
        {
            Console.WriteLine("Name: {0}", a.Attributes["name"].Value);
        }
    }
}

我将编写一些逻辑来在属性值周围插入引号。如果XML格式不正确,则加载文档时会出现错误

您可以为此使用Html2Xhtml库。以下是一个链接:

您应该能够使用库将内容放入XDocument,如下所示:

<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006&ndash;&nbsp;&nbsp;</A>; Postdoctoral Fellow, Toronto Western Hosp. 2000&ndash;06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. &amp; Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>
string html = "<html><head><TITLE>title</TITLE></head><body>I♥NY<p>b<br>c:±<img src=2 nonsense=x></a><font size=2>c</font></body></html>";

var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace: true);

Console.WriteLine(xdoc);
string html=“titleI♥NYb
c:±c”; var xdoc=Html2Xhtml.RunAsFilter(stdin=>stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace:true); 控制台写入线(xdoc);
我相信Html2Xhtml支持.NET 2.0框架和更高版本,如果不支持,我很确定以前的版本之一会支持,但如果不支持,您可以使用:

本文使用HTML Tidy,本文中的源代码应该可以在2.0中使用。

Yuo也可以尝试,非常适合此类问题

using (var strReader = new StringReader(html))
{
    using (SgmlReader sgmlReader = new SgmlReader())
    {
        sgmlReader.DocType = "HTML";
        sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
        sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
        sgmlReader.InputStream = strReader;

        // create document
        XmlDocument doc = new XmlDocument();
        doc.PreserveWhitespace = true;
        doc.Load(sgmlReader);
    }
}

是的,这正是我在问题中所要求的,在属性中添加引号的逻辑,如果你知道怎么做的话,这就是我想要的…@tsdexter:见编辑后的答案。Html2Xhtml库将为您完成所有艰苦的工作,并将内容放入XDocument。该文档不是有效的XML文档。必须引用属性才能使其成为有效的XML。这是HTML,不是XML。是的,我意识到它是无效的,这就是为什么我要求提供如何使它有效的建议。我正在尝试使用它,但是我需要从字符串而不是文件加载HTML,我如何才能做到这一点?@tsdexter,通过使用
document.LoadHtml
方法而不是
document.load
。谢谢,下载中没有文档?而intellisense并不暗示可用的methods@tsdexter,当我在VS中键入
document.LoadHtml(
时,Intellisense显示我从指定字符串
加载HTML文档。
.Hmm,可能是因为我在VS 2005中使用了.net 2.0。。无论如何,这个答案对我很有用。非常感谢!
using (var strReader = new StringReader(html))
{
    using (SgmlReader sgmlReader = new SgmlReader())
    {
        sgmlReader.DocType = "HTML";
        sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
        sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
        sgmlReader.InputStream = strReader;

        // create document
        XmlDocument doc = new XmlDocument();
        doc.PreserveWhitespace = true;
        doc.Load(sgmlReader);
    }
}