如何编写正则表达式来查找XML文档中CDATA标记之外的HTML标记

如何编写正则表达式来查找XML文档中CDATA标记之外的HTML标记,html,regex,xml,notepad++,Html,Regex,Xml,Notepad++,我正在尝试导入一个ONIX(XML)文件,该文件由于描述性文本中的HTML标记而出现导入错误。在这个特定的文件中,一些描述性文本包含在CDATA标记中,但有些似乎没有 如何编写一个正则表达式来查找未包含在CDATA标记中的HTML标记 我正在使用VB.NET应用程序将数据导入SQL Server数据库,但此时我正在尝试用Notepad++编写正则表达式,以了解可能的情况。我可以在以后将正则表达式合并到VB代码中 下面是一些可以正确导入的XML的示例: <OtherText> &l

我正在尝试导入一个ONIX(XML)文件,该文件由于描述性文本中的HTML标记而出现导入错误。在这个特定的文件中,一些描述性文本包含在CDATA标记中,但有些似乎没有

如何编写一个正则表达式来查找未包含在CDATA标记中的HTML标记

我正在使用VB.NET应用程序将数据导入SQL Server数据库,但此时我正在尝试用Notepad++编写正则表达式,以了解可能的情况。我可以在以后将正则表达式合并到VB代码中

下面是一些可以正确导入的XML的示例:

<OtherText>
  <TextTypeCode>01</TextTypeCode>
  <TextFormat>02</TextFormat>
  <Text><![CDATA[More than simply a series of chapters on the theology of John's Gospel, <em>Jesus Is the Christ</em> relates each of John's teachings to his declared aim, expressed in John 20: 30-31: "Jesus did many other signs before his disciples, which have not been written in this book; but these have been written that you may believe that Jesus is the Christ, the Son of God, and that believing you may have life in his name." Indeed, each chapter in Morris's book takes up some facet or aspect of John's expressed aim.<br/><br/>For an age still asking the question "Who is Jesus?" Leon Morris argues convincingly that John's entire Gospel was written to show that the human Jesus is the Christ, or Messiah, as well as the Son of God. But it is Morris's firm conviction that John's purpose was evangelical as well as theological -- that is, John wrote his book so that readers might believe in Christ and as a result have eternal life.]]></Text>
</OtherText>
<OtherText>
  <TextTypeCode>01</TextTypeCode>
  <TextFormat>02</TextFormat>
  <Text>More than simply a series of chapters on the theology of John's Gospel, <em>Jesus Is the Christ</em> relates each of John's teachings to his declared aim, expressed in John 20: 30-31: "Jesus did many other signs before his disciples, which have not been written in this book; but these have been written that you may believe that Jesus is the Christ, the Son of God, and that believing you may have life in his name." Indeed, each chapter in Morris's book takes up some facet or aspect of John's expressed aim.<br/><br/>For an age still asking the question "Who is Jesus?" Leon Morris argues convincingly that John's entire Gospel was written to show that the human Jesus is the Christ, or Messiah, as well as the Son of God. But it is Morris's firm conviction that John's purpose was evangelical as well as theological -- that is, John wrote his book so that readers might believe in Christ and as a result have eternal life.</Text>
</OtherText>

01
02

在一个仍在问“耶稣是谁”的时代,利昂·莫里斯(Leon Morris)令人信服地辩称,约翰的整个福音书都是为了表明人类耶稣是基督或弥赛亚,也是上帝的儿子。但莫里斯坚信约翰的目的是福音和神学——也就是说,约翰写这本书是为了让读者相信基督,从而获得永生。]]>
以下是无法正确导入的XML:

<OtherText>
  <TextTypeCode>01</TextTypeCode>
  <TextFormat>02</TextFormat>
  <Text><![CDATA[More than simply a series of chapters on the theology of John's Gospel, <em>Jesus Is the Christ</em> relates each of John's teachings to his declared aim, expressed in John 20: 30-31: "Jesus did many other signs before his disciples, which have not been written in this book; but these have been written that you may believe that Jesus is the Christ, the Son of God, and that believing you may have life in his name." Indeed, each chapter in Morris's book takes up some facet or aspect of John's expressed aim.<br/><br/>For an age still asking the question "Who is Jesus?" Leon Morris argues convincingly that John's entire Gospel was written to show that the human Jesus is the Christ, or Messiah, as well as the Son of God. But it is Morris's firm conviction that John's purpose was evangelical as well as theological -- that is, John wrote his book so that readers might believe in Christ and as a result have eternal life.]]></Text>
</OtherText>
<OtherText>
  <TextTypeCode>01</TextTypeCode>
  <TextFormat>02</TextFormat>
  <Text>More than simply a series of chapters on the theology of John's Gospel, <em>Jesus Is the Christ</em> relates each of John's teachings to his declared aim, expressed in John 20: 30-31: "Jesus did many other signs before his disciples, which have not been written in this book; but these have been written that you may believe that Jesus is the Christ, the Son of God, and that believing you may have life in his name." Indeed, each chapter in Morris's book takes up some facet or aspect of John's expressed aim.<br/><br/>For an age still asking the question "Who is Jesus?" Leon Morris argues convincingly that John's entire Gospel was written to show that the human Jesus is the Christ, or Messiah, as well as the Son of God. But it is Morris's firm conviction that John's purpose was evangelical as well as theological -- that is, John wrote his book so that readers might believe in Christ and as a result have eternal life.</Text>
</OtherText>

01
02
不仅仅是一系列关于约翰福音神学的章节,耶稣是基督,它将约翰的每一个教导与他宣称的目标联系起来,在约翰福音20:30-31中表达:耶稣在门徒面前行了许多别的神迹,都没有写在这书上;但这些都是为了让你相信耶稣是基督,上帝的儿子,并且相信你可以在他的名下获得生命。”事实上,莫里斯书中的每一章都涉及到约翰明确目标的某个方面。

一个时代仍然在问“耶稣是谁?”?利昂·莫里斯(Leon Morris)令人信服地辩称,约翰的整个福音书都是为了表明人类耶稣是基督或弥赛亚,也是上帝的儿子。但莫里斯坚信约翰的目的是福音和神学——也就是说,约翰写这本书是为了让读者相信基督,从而获得永生。
现在,

02

指示标记的内容是HTML,所以我可以处理它。当我有没有正确标记的标签时,问题就出现了。我需要找到这些,以便更正它们。

此正则表达式可以帮助您实现以下目标:

<\w+>(?!<![CDATA[)
(?)?!

我在Sublime文本中提供的示例上运行了它,它只匹配了没有跟在CDATA内容后面的HTML标记。

Regex可能不是适合这项工作的工具。您需要的是一个XML/HTML解析器,它可以原谅这种格式(我不清楚它是否格式不正确,或者只是无法对模式进行验证)并让您对适当的部分进行重新编码。请发布一个xml行的示例,您打算在其上匹配正则表达式。基本上这就是我正在编写的内容。ONIX标准允许某些字段中包含HTML,因此我将根据它们的属性查找这些字段并插入CDATA标记。通常,一旦完成,我就可以导入xml into数据集和所有内容都运行良好,但偶尔我会遇到一个文件,其中的标记不符合标准,因此我会出现导入错误。不符合什么标准?你是说你有一些现有代码吗?如果有,请在问题中包括它,并提供一个明确的示例说明它不起作用的地方。人们更可能帮助你提高我的建议是放弃这一思路,因为正则表达式通常不适合处理XML的递归结构。为此,我不得不避开方括号,让它在Notepad++中工作,如下所示:
(?!如果我正确理解您的问题,您可以用em替换\w+。谢谢-我应该自己解决这个问题!