C# 使用HtmlAgilityPack进行刮取_C#_Html Agility Pack

C# 使用HtmlAgilityPack进行刮取

C# 使用HtmlAgilityPack进行刮取,c#,html-agility-pack,C#,Html Agility Pack,我有一个巨大的html页面，我想从中删除一些值我尝试使用Firebug获取我想要的元素的XPath，但它不是一个静态XPath，因为它随时都在变化，所以如何获取我想要的值在下面的代码片段中，我想得到每小时木材的产量，它位于20 <div class="boxes-contents cf"><table id="production" cellpadding="1" cellspacing="1"> <thead> <t

我有一个巨大的html页面，我想从中删除一些值

我尝试使用Firebug获取我想要的元素的XPath，但它不是一个静态XPath，因为它随时都在变化，所以如何获取我想要的值

在下面的代码片段中，我想得到每小时木材的产量，它位于20

    <div class="boxes-contents cf"><table id="production" cellpadding="1" cellspacing="1">
    <thead>
        <tr>
            <th colspan="4">
                Production per hour:            </th>
        </tr>
    </thead>
    <tbody>
                <tr>
            <td class="ico">
                <img class="r1" src="img/x.gif" alt="Lumber" title="Lumber" />
            </td>
            <td class="res">
                Lumber:
            </td>
            <td class="num">
                20          </td>
        </tr>
                <tr>
            <td class="ico">
                <img class="r2" src="img/x.gif" alt="Clay" title="Clay" />
            </td>
            <td class="res">
                Clay:
            </td>
            <td class="num">
                20          </td>
        </tr>
                <tr>
            <td class="ico">
                <img class="r3" src="img/x.gif" alt="Iron" title="Iron" />
            </td>
            <td class="res">
                Iron:
            </td>
            <td class="num">
                20          </td>
        </tr>
                <tr>
            <td class="ico">
                <img class="r4" src="img/x.gif" alt="Crop" title="Crop" />
            </td>
            <td class="res">
                Crop:
            </td>
            <td class="num">
                59          </td>
        </tr>
            </tbody>
</table>
    </div>


每小时产量：
木材：
20
粘土：
20
铁：
20
作物：
59

使用Html agility pack，您将需要执行以下操作

byte[] htmlBytes;
MemoryStream htmlMemStream;
StreamReader htmlStreamReader;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlBytes = webclient.DownloadData(url);
htmlMemStream = new MemoryStream(htmlBytes);
htmlStreamReader = new StreamReader(htmlMemStream);
htmlDoc.LoadHtml(htmlStreamReader.ReadToEnd());

var table = htmlDoc.DocumentNode.Descendants("table").FirstOrDefault();

var lumberTd = table.Descendants("td").Where(node => node.Attributes["class"] != null && node.Attributes["class"].Value == "num").FirstOrDefault();

string lumberValue = lumberTd.InnerText.Trim();

警告，“FirstOrDefault（）”可能返回null，因此您可能应该在其中进行一些检查

希望对您有所帮助。

使用Html agility pack，您将需要执行以下操作

byte[] htmlBytes;
MemoryStream htmlMemStream;
StreamReader htmlStreamReader;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlBytes = webclient.DownloadData(url);
htmlMemStream = new MemoryStream(htmlBytes);
htmlStreamReader = new StreamReader(htmlMemStream);
htmlDoc.LoadHtml(htmlStreamReader.ReadToEnd());

var table = htmlDoc.DocumentNode.Descendants("table").FirstOrDefault();

var lumberTd = table.Descendants("td").Where(node => node.Attributes["class"] != null && node.Attributes["class"].Value == "num").FirstOrDefault();

string lumberValue = lumberTd.InnerText.Trim();

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(fileName);

var result = doc.DocumentNode.SelectNodes("//div[@class='boxes-contents cf']//tbody/tr")
                .First(tr => tr.Element("td").Element("img").Attributes["title"].Value == "Lumber")
                .Elements("td")
                .First(td=>td.Attributes["class"].Value=="num")
                .InnerText
                .Trim();

警告，“FirstOrDefault（）”可能返回null，因此您可能应该在其中进行一些检查

希望这能有所帮助。

你能发布一些代码来显示你已经尝试过的内容吗。这是一个地狱般的noobish试用版，没有对象引用错误，不管怎样我都会把它放在那里，尽管有用扫描你，请发布一些代码来显示你已经尝试过的内容。这是一个地狱般的noobish试用版，没有对象引用错误，不管怎样我都会把它放在那里uselessFirst非常感谢您的帮助，但是有没有更直接的方法可以从一个网页中获取我想要的值，特别是那些没有用特定ID标记的值呢？从您提供的示例html中，我看不到任何唯一的ID，这将使您想要的数据的提取变得更容易。不，您没有理解我的意思当元素有一个ID时，我可以很容易地使用GetElementById来获取它，那么有没有像GetElementById方法那样直接的方法来获取特定的节点而不嵌套大量的代码呢？是的，您可以执行htmlDoc.DocumentNode.Subjections（）.Where（node=>node.Attributes[“ID”！=null&&node.Attributes[“ID”].Value==“myid”）.FirstOrDefault（）；首先，非常感谢你的帮助，但是有没有更直接的方法可以从一个网页中获取我想要的值，特别是那些没有用特定ID标记的值呢？从你提供的html示例中，我看不到任何唯一的ID，这将使你想要的数据的提取变得更容易。不，你没有理解我的意思当元素有一个ID时，我可以很容易地使用GetElementById来获取它，那么有没有像GetElementById方法那样直接的方法来获取特定的节点而不嵌套大量的代码呢？是的，您可以执行htmlDoc.DocumentNode.Subjections（）。其中（node=>node.Attributes[“ID”！=null&&node.Attributes[“ID”]。Value==“myid”）.FirstOrDefault（）；这确实是一段非常好的代码，但我想知道如何确定获得所需元素的代码。不要给我一条鱼，但要学我如何得到一条：D，正如我记得的那样：PWill这是一段非常好的代码，但我想知道如何确定得到我想要的元素的代码。不要给我一条鱼，但要学我如何得到我记得的一条：D:P

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(fileName);

var result = doc.DocumentNode.SelectNodes("//div[@class='boxes-contents cf']//tbody/tr")
                .First(tr => tr.Element("td").Element("img").Attributes["title"].Value == "Lumber")
                .Elements("td")
                .First(td=>td.Attributes["class"].Value=="num")
                .InnerText
                .Trim();