C# C使用linq提取要列出的html值_C#_Html_List_Linq_Web

C# C使用linq提取要列出的html值

c# html list linq web

C# C使用linq提取要列出的html值,c#,html,list,linq,web,C#,Html,List,Linq,Web,我对C，尤其是HtmlAgilityPack和linq，绝对是新手，但我正试图组合一个linq语句，从特定字段中的html页面检索特定值。这是我正在使用的链接声明： List<testClass> Results = (from div in doc1.DocumentNode.Descendants("div") from c in di

我对C，尤其是HtmlAgilityPack和linq，绝对是新手，但我正试图组合一个linq语句，从特定字段中的html页面检索特定值。这是我正在使用的链接声明：

List<testClass> Results = (from div in doc1.DocumentNode.Descendants("div")                                      
                                   from c in div.Descendants("class")
                                   select new testClass(
                                              c.Attributes["hdp-fact-ataglance-heading"].Value,
                                              c.Attributes["hdp-fact-ataglance-value"].Value
                                              )).ToList();

它工作不正常，我不知道为什么，我希望它把值放入一个列表，我可以读取到屏幕上。不幸的是，我得到了一个0值的空白列表。linq在调试器中读取为正确，我得到0个错误。我想做的主要事情是避免为了性能目的而使用任何循环。我认为我没有选择正确的节点，或者没有正确地组合linq。这是HTML的一个片段：

    <div class="hdp-fact-category">Bedrooms</div>
    <ul class="zsg-sm-1-1 hdp-fact-list" id="yui_3_18_1_2_1499723568429_1169">
        <li class="" id="yui_3_18_1_2_1499723568429_1168">
            <span class="hdp-fact-name">Beds: </span>
            <span class="hdp-fact-value" id="yui_3_18_1_2_1499723568429_1167">4</span>
        </li>
    </ul>
</div>
<div class="hdp-fact-container" id="yui_3_18_1_2_1499723568429_2392">
    <div class="hdp-fact-category">Heating and Cooling</div>
    <ul class="zsg-sm-1-1 hdp-fact-list" id="yui_3_18_1_2_1499723568429_2391">
        <li class="">
            <span class="hdp-fact-name">Heating: </span>
            <span class="hdp-fact-value">Forced air</span>
        </li>
        <li class="" id="yui_3_18_1_2_1499723568429_2390">
            <span class="hdp-fact-name">Cooling: </span>
            <span class="hdp-fact-value">Central</span>
        </li>
    </ul>
</div>
<div class="hdp-fact-container">
    <div class="hdp-fact-category">Basement</div>
    <ul class="zsg-sm-1-1 hdp-fact-list">
        <li class="">
            <span class="hdp-fact-value">Unfinished basement</span>
        </li>
    </ul>
</div>

我最终尝试使用通常的循环和控制台获得这样的控制台输出床位：4张

供暖：强制通风

等等，我只想用这个特殊的循环来检查循环内容，它不是永久的

到目前为止，我正在使用它来提取我的数据。。。。至少可以说，这需要很长时间。这就是为什么我尝试使用linq，因为我认为它可能会更快？我不知道最好的方法是什么，任何建议都会得到赞赏

public string searchSCH(string content, string starttag, string endtag, int count)
    {
        string contentsub;
        int location1, location2;
        location1 = location2 = 0;
        if (content.Contains(starttag))
        {
            do
            {
                location1 = content.IndexOf(starttag, location1 + 1);
                if (location1 == -1)
                    return null;

                count--;
            } while (count > 0);

            location2 = content.IndexOf(endtag, location1 + 1);
            if (location2 == -1)
                return null;

            location1 += starttag.Length;
            contentsub = content.Substring(location1, location2 - location1);

            contentsub = Regex.Replace(contentsub, @"<[^>]+>|&nbsp;", string.Empty).Trim();
            contentsub = Regex.Replace(contentsub, "\".*>", string.Empty).Trim();
            contentsub = Regex.Replace(contentsub, "  ", "%");
            contentsub = Regex.Replace(contentsub, "\n", string.Empty);
            contentsub = Regex.Replace(contentsub, "\r", string.Empty);
            contentsub = Regex.Replace(contentsub, "\">", string.Empty);
            contentsub = Regex.Replace(contentsub, "\"%.*>", string.Empty);
            contentsub = Regex.Replace(contentsub, @"%+", "|");
            return contentsub;
        }
        else
        {
            return "fail";
        }

    }

将您的逻辑分解为更小的部分：

//This should give you a list of all the containers
var nodes = doc1.DocumentNode.SelectNodes("//div[contains(@class, 'hdp-fact-container')]");

//Then loop through each container to grab the category ("hdp-fact-category")
//and the facts list ("hdp-fact-list")
foreach(var item in nodes)
{
    var categoryNode = item.SelectNode("//div[contains(@class, 'hdp-fact-category']");
    var factsList = item.SelectNodes("//ul[contains(@class, 'hdp-fact-list')]/li");

    //Then do whatever with those nodes
}

我的回答基于这样一个假设：你也会想要这个类别。如果没有，则排除该表达式。

您不能只说“它工作不正常”，并期望我们知道如何解决它。。。告诉我们它是如何不起作用的。。。。你有错误吗？不是预期的结果？如果没有-您得到了什么结果？很抱歉，调试中没有错误。我没有在列表中获取任何值。不确定是否可以通过使用类来筛选子项，该类不是标记，而是此处的属性div.genderansclass您知道避免使用循环来提取所需数据的方法吗？我正在处理大约1.4mil html文件。@Wes您最初的解决方案是使用LINQ，您认为其中一些操作符在内部做什么？您是否尝试过我的解决方案以查看它如何处理您的数据？在解析HTML/XML时，没有太多的优化可以做，文档的大小决定了您的选择。我很抱歉，我不是想暗示linq没有循环。我只是想知道你是否知道一种不涉及循环的提取所需数据的方法，因为你比我更了解循环。不幸的是，我无法让上面的代码工作。你需要处理多个项目，这正是循环设计的目的。你对使用循环有什么顾虑？还有，我的代码中有什么不起作用？我得到了这个错误System.Xml.XPath.XPathException://div[contains@class“，”hdp事实列表“]”有一个无效的标记。“我不熟悉XPath，使用XPath是否有一些我应该注意的限制条件？”？