Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/csharp/300.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/drupal/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
C# HtmlAlityPack和动态内容问题_C#_Html Agility Pack_Dynamic Content - Fatal编程技术网

C# HtmlAlityPack和动态内容问题

C# HtmlAlityPack和动态内容问题,c#,html-agility-pack,dynamic-content,C#,Html Agility Pack,Dynamic Content,我想创建一个web scraper应用程序,我想用webbrowser控件、htmlagilitypack和xpath来实现 现在我设法创建了xpath生成器(为此我使用了webbrowser),它工作得很好,但有时我无法动态(通过javascript或ajax)获取生成的内容。我还发现,当webbrowser控件(实际上是IE浏览器)生成一些额外的标记,如“tbody”,而htmlagilitypack `htmlWeb.Load(webBrowser.DocumentStream);`我没看

我想创建一个web scraper应用程序,我想用webbrowser控件、htmlagilitypack和xpath来实现

现在我设法创建了xpath生成器(为此我使用了webbrowser),它工作得很好,但有时我无法动态(通过javascript或ajax)获取生成的内容。我还发现,当webbrowser控件(实际上是IE浏览器)生成一些额外的标记,如“tbody”,而htmlagilitypack `htmlWeb.Load(webBrowser.DocumentStream);`我没看见。

另一张纸条。我发现下面的代码实际上获取了当前的网页源代码,但我无法提供htmlagilitypack `(mshtml.IHTMLDocument3)webBrowser.Document.DomDocument`


您能帮我一下吗?

使用HTML Agility pack文档的以下方法

htmlAgilityPackDocument.LoadHtml(this.browser.DocumentText);


我只是花了几个小时试图让HtmlAgilityPack从一个网页呈现一些ajax动态内容,我从一篇无用的文章转到另一篇,直到找到这篇

答案隐藏在最初帖子下的一条评论中,我想我应该把它弄清楚

这是我最初使用但不起作用的方法:

private void LoadTraditionalWay(String url)
{
    WebRequest myWebRequest = WebRequest.Create(url);
    WebResponse myWebResponse = myWebRequest.GetResponse();
    Stream ReceiveStream = myWebResponse.GetResponseStream();
    Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
    TextReader reader = new StreamReader(ReceiveStream, encode);
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.Load(reader);
    reader.Close();
}
WebRequest不会呈现或执行呈现缺失内容的ajax查询

这就是有效的解决方案:

private void LoadHtmlWithBrowser(String url)
{
    webBrowser1.ScriptErrorsSuppressed = true;
    webBrowser1.Navigate(url);

    waitTillLoad(this.webBrowser1);

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser1.Document.DomDocument; 
    StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML); 
    doc.Load(sr);
}

private void waitTillLoad(WebBrowser webBrControl)
{
    WebBrowserReadyState loadStatus;
    int waittime = 100000;
    int counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if ((counter > waittime) || (loadStatus == WebBrowserReadyState.Uninitialized) || (loadStatus == WebBrowserReadyState.Loading) || (loadStatus == WebBrowserReadyState.Interactive))
        {
            break;
        }
        counter++;
    }

    counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if (loadStatus == WebBrowserReadyState.Complete && webBrControl.IsBusy != true)
        {
            break;
        }
        counter++;
    }
}
想法是使用WebBrowser加载,WebBrowser能够呈现ajax内容,然后等待页面完全呈现,然后使用Microsoft.mshtml库将HTML重新解析到agility包中

这是我访问动态数据的唯一方法


希望它能帮助那些做这件事的人。据我所知,它创建了浏览器引擎的实例。。某种程度上,应该允许执行js,并允许您获得被操纵DOM的结果。

关于什么的帮助?你的具体问题是什么?您必须显示一些代码才能获得真正的帮助。对不起,伙计们,我在这里找到了解决方案:var documentAsIHtmlDocument3=(mshtml.IHTMLDocument3)webBrowser.Document.DomDocument;StringReader sr=新的StringReader(documentsIhtmlDocument3.documentElement.outerHTML);htmlDoc.Load(sr);成功了。@user1322188:你怎么能检索页面的动态内容?是不是htmlagility pack被用来检索动态内容。干得好,尼克!感谢您发布您的解决方案——它对我非常有用!真烦人!在添加引用时,我将添加MSHTML名为“Microsoft HTML对象库”。用于传递到HTMLAgilityPAck的文档现在是否在“sr”中,这只是需要处理吗?webBrowser1是什么时候?仅供参考,如果您不是在WinForms(或任何STA)上下文中运行,则必须在STA容器中启动WebBrowser。如下所示:var t=新线程(MyThreadStartMethod);t、 SetApartmentState(ApartmentState.STA);t、 Start();我有同样的问题,我想得到表的内容,这是动态加载JS的div是由JS创建的,它的id是packageTabContainer,但我得到null,我尝试了解决方案,但没有得到这里的内容是我需要提取的链接。昨晚我自己用Selenium尝试了这一点(尽管等待了一段时间),它允许页面上的javascript更新DOM,我可以通过代码访问对DOM的更改。
private void LoadHtmlWithBrowser(String url)
{
    webBrowser1.ScriptErrorsSuppressed = true;
    webBrowser1.Navigate(url);

    waitTillLoad(this.webBrowser1);

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser1.Document.DomDocument; 
    StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML); 
    doc.Load(sr);
}

private void waitTillLoad(WebBrowser webBrControl)
{
    WebBrowserReadyState loadStatus;
    int waittime = 100000;
    int counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if ((counter > waittime) || (loadStatus == WebBrowserReadyState.Uninitialized) || (loadStatus == WebBrowserReadyState.Loading) || (loadStatus == WebBrowserReadyState.Interactive))
        {
            break;
        }
        counter++;
    }

    counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if (loadStatus == WebBrowserReadyState.Complete && webBrControl.IsBusy != true)
        {
            break;
        }
        counter++;
    }
}