用C#解析html的最佳方法是什么?
我正在寻找一个库/方法来解析html文件,该库/方法比通用xml解析库具有更多特定于html的功能。您可以使用html DTD和通用xml解析库。解析html的问题在于它不是一门精确的科学。如果您正在解析的是XHTML,那么事情就会简单得多(正如您提到的,您可以使用通用XML解析器)。因为HTML不一定是格式良好的XML,所以在解析它时会遇到很多问题。这几乎需要逐个站点进行。您可以使用TidyNet.Tidy将HTML转换为XHTML,然后使用XML解析器 另一种选择是使用内置引擎mshtml:用C#解析html的最佳方法是什么?,c#,.net,html,parsing,html-content-extraction,C#,.net,Html,Parsing,Html Content Extraction,我正在寻找一个库/方法来解析html文件,该库/方法比通用xml解析库具有更多特定于html的功能。您可以使用html DTD和通用xml解析库。解析html的问题在于它不是一门精确的科学。如果您正在解析的是XHTML,那么事情就会简单得多(正如您提到的,您可以使用通用XML解析器)。因为HTML不一定是格式良好的XML,所以在解析它时会遇到很多问题。这几乎需要逐个站点进行。您可以使用TidyNet.Tidy将HTML转换为XHTML,然后使用XML解析器 另一种选择是使用内置引擎mshtml:
using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);
这允许您使用类似javascript的函数,如getElementById()我认为@Erlend使用
HTMLDocument
是最好的方法。然而,我也很幸运地使用了这个简单的库:
您可以在不疯狂使用第三方产品和mshtml(即互操作)的情况下做很多事情。使用System.Windows.Forms.WebBrowser。从那里,您可以在HtmlDocument上执行“GetElementById”或在HtmlElements上执行“GetElementsByTagName”等操作。如果您想实际与浏览器进行交互(例如模拟按钮单击),可以使用一点反射(比互操作的危害更小):
var wb = new WebBrowser()
。。。告诉浏览器导航(与此问题相切)。然后在Document_Completed事件上,您可以像这样模拟单击
var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);
你可以做类似的反思来提交表单等等
享受。我过去使用过加载随机网站,然后使用xpath(例如/html/body//p[@class='textblock'])攻击内容的各个部分。它工作得很好,但也有一些特殊的网站存在问题,所以我不知道这是否是绝对最好的解决方案。
这是一个敏捷的HTML解析器,它构建读/写DOM并支持纯XPATH或XSLT(您实际上不必理解XPATH或XSLT就可以使用它,不用担心……)。它是一个.NET代码库,允许您解析“web外”HTML文件。解析器对“真实世界”格式错误的HTML非常宽容。对象模型与System.Xml非常相似,但适用于HTML文档(或流)
Html敏捷包之前已经提到过——如果你想提高速度,你可能还想看看。它的处理相当笨拙,但它提供了非常快速的解析体验。我已经编写了一些代码,提供了“LINQ到HTML”的功能。我想我会在这里分享。它是根据雄伟的12。它接受Majestic-12结果并生成linqxml元素。此时,您可以针对HTML使用所有LINQ到XML工具。例如:
IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);
foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {
if (anchorTag.Attribute("href") == null)
continue;
Console.WriteLine(anchorTag.Attribute("href").Value);
}
IEnumerable auctionNodes=Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(bytearrayofactionHTML);
foreach(auctionNodes.OfType().DegenantsAndSelf(“a”)中的XElement anchorTag){
if(anchorTag.Attribute(“href”)==null)
继续;
Console.WriteLine(anchorTag.Attribute(“href”).Value);
}
我想使用Majestic-12,因为我知道它有很多关于HTML的内置知识,而HTML是在野外发现的。但我发现,要将Majestic-12结果映射到LINQ将接受为XML的内容,还需要额外的工作。我包含的代码进行了大量的清理,但是当您使用它时,您会发现被拒绝的页面。你需要修改代码来解决这个问题。引发异常时,请检查exception.Data[“source”],因为它很可能设置为导致异常的HTML标记。以一种好的方式处理HTML有时并不简单
现在期望值实际上很低,下面是代码:)
使用系统;
使用System.Collections.Generic;
使用System.Linq;
使用系统文本;
使用威严12;
使用System.IO;
使用System.Xml.Linq;
使用系统诊断;
使用System.Text.RegularExpressions;
命名空间Majestic12ToXml{
公共类Majestic12ToXml{
静态公共IEnumerable ConvertNodesToXml(字节[]htmlAsBytes){
HTMLparser parser=OpenParser();
Init(htmlAsBytes);
XElement currentNode=新XElement(“文档”);
HTMLchunk m12chunk=null;
int xmlnsAttributeIndex=0;
字符串originalHtml=“”;
而((m12cunk=parser.ParseNext())!=null){
试一试{
Debug.Assert(!m12cunk.bHashMode);//Majestic-12设置的常用默认值
XNode newNode=null;
XElement newNodesParent=null;
开关(m12cunk.oType){
案例HTMLchunkType.OpenTag:
//标记作为子项添加到当前标记中,
//除非新标记意味着
//一些祖先标记。
newNode=ParseTagNode(m12cunk,originalHtml,ref xmlnsAttributeIndex);
if(newNode!=null){
currentNode=FindParentOfNewNode(M12区块,原始HTML,currentNode);
newNodesParent=currentNode;
newNodesParent.Add(newNode);
currentNode=newNode作为XElement;
}
打破
案例HTMLchunkType.CloseTag:
if(m12cunk.bEndClosure){
newNode=ParseTagNode(m12cunk,originalHtml,ref xmlnsAttributeIndex);
if(newNode!=null){
currentNode=FindParentOfNewNode(M12区块,原始HTML,currentNode);
newNodesParent=currentNode;
newNodesParent.Add(newNode);
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace Majestic12ToXml {
public class Majestic12ToXml {
static public IEnumerable<XNode> ConvertNodesToXml(byte[] htmlAsBytes) {
HTMLparser parser = OpenParser();
parser.Init(htmlAsBytes);
XElement currentNode = new XElement("document");
HTMLchunk m12chunk = null;
int xmlnsAttributeIndex = 0;
string originalHtml = "";
while ((m12chunk = parser.ParseNext()) != null) {
try {
Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting
XNode newNode = null;
XElement newNodesParent = null;
switch (m12chunk.oType) {
case HTMLchunkType.OpenTag:
// Tags are added as a child to the current tag,
// except when the new tag implies the closure of
// some number of ancestor tags.
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
currentNode = newNode as XElement;
}
break;
case HTMLchunkType.CloseTag:
if (m12chunk.bEndClosure) {
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
}
}
else {
XElement nodeToClose = currentNode;
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
nodeToClose = nodeToClose.Parent;
if (nodeToClose != null)
currentNode = nodeToClose.Parent;
Debug.Assert(currentNode != null);
}
break;
case HTMLchunkType.Script:
newNode = new XElement("script", "REMOVED");
newNodesParent = currentNode;
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Comment:
newNodesParent = currentNode;
if (m12chunk.sTag == "!--")
newNode = new XComment(m12chunk.oHTML);
else if (m12chunk.sTag == "![CDATA[")
newNode = new XCData(m12chunk.oHTML);
else
throw new Exception("Unrecognized comment sTag");
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Text:
currentNode.Add(m12chunk.oHTML);
break;
default:
break;
}
}
catch (Exception e) {
var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);
// the original html is copied for tracing/debugging purposes
originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
.Take(m12chunk.iChunkLength)
.Select(B => (char)B).ToArray());
wrappedE.Data.Add("source", originalHtml);
throw wrappedE;
}
}
while (currentNode.Parent != null)
currentNode = currentNode.Parent;
return currentNode.Nodes();
}
static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
XElement discoveredParent = null;
// Get a list of all ancestors
List<XElement> ancestors = new List<XElement>();
XElement ancestor = nextPotentialParent;
while (ancestor != null) {
ancestors.Add(ancestor);
ancestor = ancestor.Parent;
}
// Check if the new tag implies a previous tag was closed.
if ("form" == m12chunkCleanedTag) {
discoveredParent = ancestors
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("td" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "tr" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("tr" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => !("table" == XE.Name
|| "thead" == XE.Name
|| "tbody" == XE.Name
|| "tfoot" == XE.Name))
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("thead" == m12chunkCleanedTag
|| "tbody" == m12chunkCleanedTag
|| "tfoot" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "table" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
return discoveredParent ?? nextPotentialParent;
}
static string CleanupTagName(string originalName, string originalHtml) {
string tagName = originalName;
tagName = tagName.TrimStart(new char[] { '?' }); // for nodes <?xml >
if (tagName.Contains(':'))
tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);
return tagName;
}
static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);
static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {
result = null;
string attributeName = originalName;
if (string.IsNullOrEmpty(originalName))
return false;
if (_startsAsNumeric.IsMatch(originalName))
return false;
//
// transform xmlns attributes so they don't actually create any XML namespaces
//
if (attributeName.ToLower().Equals("xmlns")) {
attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
xmlnsIndex++;
}
else {
if (attributeName.ToLower().StartsWith("xmlns:")) {
attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
}
//
// trim trailing \"
//
attributeName = attributeName.TrimEnd(new char[] { '\"' });
attributeName = attributeName.Replace(":", "_");
}
result = attributeName;
return true;
}
static Regex _weirdTag = new Regex(@"^<!\[.*\]>$"); // matches "<![if !supportEmptyParas]>"
static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches "<%@ ... %>"
static Regex _shortHtmlComment = new Regex(@"^<!-.*->$"); // matches "<!-Extra_Images->"
static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {
if (string.IsNullOrEmpty(m12chunk.sTag)) {
if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
return new XElement("doctype");
if (_weirdTag.IsMatch(originalHtml))
return new XElement("REMOVED_weirdBlockParenthesisTag");
if (_aspnetPrecompiled.IsMatch(originalHtml))
return new XElement("REMOVED_ASPNET_PrecompiledDirective");
if (_shortHtmlComment.IsMatch(originalHtml))
return new XElement("REMOVED_ShortHtmlComment");
// Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
return null;
}
string tagName = CleanupTagName(m12chunk.sTag, originalHtml);
XElement result = new XElement(tagName);
List<XAttribute> attributes = new List<XAttribute>();
for (int i = 0; i < m12chunk.iParams; i++) {
if (m12chunk.sParams[i] == "<!--") {
// an HTML comment was embedded within a tag. This comment and its contents
// will be interpreted as attributes by Majestic-12... skip this attributes
for (; i < m12chunk.iParams; i++) {
if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
break;
}
continue;
}
if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
continue;
string attributeName = m12chunk.sParams[i];
if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
continue;
attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
}
// If attributes are duplicated with different values, we complain.
// If attributes are duplicated with the same value, we remove all but 1.
var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);
foreach (var duplicatedAttribute in duplicatedAttributes) {
if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
throw new Exception("Attribute value was given different values");
attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
attributes.Add(duplicatedAttribute.First());
}
result.Add(attributes);
return result;
}
static HTMLparser OpenParser() {
HTMLparser oP = new HTMLparser();
// The code+comments in this function are from the Majestic-12 sample documentation.
// ...
// This is optional, but if you want high performance then you may
// want to set chunk hash mode to FALSE. This would result in tag params
// being added to string arrays in HTMLchunk object called sParams and sValues, with number
// of actual params being in iParams. See code below for details.
//
// When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
oP.SetChunkHashMode(false);
// if you set this to true then original parsed HTML for given chunk will be kept -
// this will reduce performance somewhat, but may be desireable in some cases where
// reconstruction of HTML may be necessary
oP.bKeepRawHTML = false;
// if set to true (it is false by default), then entities will be decoded: this is essential
// if you want to get strings that contain final representation of the data in HTML, however
// you should be aware that if you want to use such strings into output HTML string then you will
// need to do Entity encoding or same string may fail later
oP.bDecodeEntities = true;
// we have option to keep most entities as is - only replace stuff like
// this is called Mini Entities mode - it is handy when HTML will need
// to be re-created after it was parsed, though in this case really
// entities should not be parsed at all
oP.bDecodeMiniEntities = true;
if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
oP.InitMiniEntities();
// if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
// extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
// this only works if auto extraction is enabled
oP.bAutoExtractBetweenTagsOnly = true;
// if true then comments will be extracted automatically
oP.bAutoKeepComments = true;
// if true then scripts will be extracted automatically:
oP.bAutoKeepScripts = true;
// if this option is true then whitespace before start of tag will be compressed to single
// space character in string: " ", if false then full whitespace before tag will be returned (slower)
// you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
// a waste of CPU cycles
oP.bCompressWhiteSpaceBeforeTag = true;
// if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
// forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
// compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
// or open
oP.bAutoMarkClosedTagsWithParamsAsOpen = false;
return oP;
}
}
}
script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")
http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Threading;
class ParseHTML
{
public ParseHTML() { }
private string ReturnString;
public string doParsing(string html)
{
Thread t = new Thread(TParseMain);
t.ApartmentState = ApartmentState.STA;
t.Start((object)html);
t.Join();
return ReturnString;
}
private void TParseMain(object html)
{
WebBrowser wbc = new WebBrowser();
wbc.DocumentText = "feces of a dummy"; //;magic words
HtmlDocument doc = wbc.Document.OpenNew(true);
doc.Write((string)html);
this.ReturnString = doc.Body.InnerHtml + " do here something";
return;
}
}
string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
Console.WriteLine("before:" + myhtml);
myhtml = (new ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:" + myhtml);