C# 无法获取页面的HTML
我想使用HTTPWEBREQUEST获取以下页面的HTML: 目前我正在使用:C# 无法获取页面的HTML,c#,.net,exception-handling,httpwebrequest,web-scraping,C#,.net,Exception Handling,Httpwebrequest,Web Scraping,我想使用HTTPWEBREQUEST获取以下页面的HTML: 目前我正在使用: public static string getHTML(string url) { string responseData = ""; try { // System.Threading.Thread.Sleep(1000 * 1); HttpWebRequest request = (HttpWebR
public static string getHTML(string url)
{
string responseData = "";
try
{
// System.Threading.Thread.Sleep(1000 * 1);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Accept = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)";
request.Timeout = 60000;
request.AllowAutoRedirect = false;
request.Method = "GET";
request.Referer = "inkdispatch.com";
request.CookieContainer = yummycookies;
request.KeepAlive = true;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream responseStream = response.GetResponseStream();
StreamReader myStreamReader = new StreamReader(responseStream);
responseData = myStreamReader.ReadToEnd();
}
foreach (Cookie cook in response.Cookies)
{
yummycookies.Add(cook);
}
response.Close();
}
catch (Exception e)
{
responseData = "An error occurred: " + e.Message;
}
return responseData;
}
但我并没有看到任何我得到的响应,并没有错误,只是说,永久移动,当我把相同的链接放在浏览器中时,它工作。链接上有一个标记,但我确实从主页上得到了它,仍然有相同的问题需要帮助
更新
我刚刚设定:
request.AllowAutoRedirect = true;
但我得到了一个错误:
Too many automatic redirections were attempted.
at System.Net.HttpWebRequest.GetResponse()
at inkdispatchcomScraper.Program.getHTML(String url)
我打开了fiddler,显示它一次又一次地点击链接:
# Result Protocol Host URL Body Caching Content-Type Process Comments Custom
72 301 HTTP inkdispatch.com /brother?zenid=00810c6a184e63149cdca848c7f02871 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
73 301 HTTP inkdispatch.com /brother?zenid=32cf6d38541a90658d39785b6cd64fbc 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
74 301 HTTP inkdispatch.com /brother?zenid=70d0d5eaa10175d74933ba00d47876f8 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
75 301 HTTP inkdispatch.com /brother?zenid=fa45c256a07a9450274269cfa4a4e64a 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
76 301 HTTP inkdispatch.com /brother?zenid=1fb7677a7e6ae0ca32a154ebcc42e043 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
77 301 HTTP inkdispatch.com /brother?zenid=39923f8100276b1c0fa5ccfb1f8d222c 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
78 301 HTTP inkdispatch.com /brother?zenid=fef228719b375ac012c4755793a0027a 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
79 301 HTTP inkdispatch.com /brother?zenid=5c2babf5e6b9b0834f605734441ba208 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
80 301 HTTP inkdispatch.com /brother?zenid=711bdefa3ca7cccebf63b9b8a3734be1 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
81 301 HTTP inkdispatch.com /brother?zenid=c55d1b6166994be1436c9473a1519abe 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
83 301 HTTP inkdispatch.com /brother?zenid=cc66424548f23c3c64b2e0054289283f 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
84 301 HTTP inkdispatch.com /brother?zenid=6f05f06093cd345d10ca729117994ac0 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
85 301 HTTP inkdispatch.com /brother?zenid=4a2ab4d3824c4850f544f28cd71bc1bb 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
86 301 HTTP inkdispatch.com /brother?zenid=6c9d0acd69fc22821014c7e3263da7b6 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
87 301 HTTP inkdispatch.com /brother?zenid=fff05b8df3a1488add36591a2687a830 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
88 301 HTTP inkdispatch.com /brother?zenid=b10facbe8bc9b9a355fe648649067f98 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
89 301 HTTP inkdispatch.com /brother?zenid=8b767c98491178e54d12b4e85ff02b2e 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
90 301 HTTP inkdispatch.com /brother?zenid=9f0b8cb119fee9a4e276bcae5f13772d 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
91 301 HTTP inkdispatch.com /brother?zenid=943076fabf058eb1316cfa86aadb1dec 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
92 301 HTTP inkdispatch.com /brother?zenid=8bd0335032a58b9c399706cd9c695901 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
93 301 HTTP inkdispatch.com /brother?zenid=a1ba5e21f0af2750d398484e063e8303 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
94 301 HTTP inkdispatch.com /brother?zenid=e704b2951b1d136c195fd02ad4abec93 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
95 301 HTTP inkdispatch.com /brother?zenid=6d606d0785f19c17ccb1868577a9d546 0 no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Thu, 19 Nov 1981 08:52:00 GMT text/html inkdispatchcomscraper.vshost:4612
另一个更新
我看到当我在IE中打开它时,它使用重定向到/brother,但在代码的情况下,它会得到另一个ZENID ant转发到它,并且这种情况一直发生。Set
request.AllowAutoRedirect=true代码>
编辑
对于您的第二个问题,请声明yummycookies
,如下所示
public static string getHTML(string url)
{
CookieContainer yummycookies = new CookieContainer();
...
}
设置request.AllowAutoRedirect=true代码>
编辑
对于您的第二个问题,请声明yummycookies
,如下所示
public static string getHTML(string url)
{
CookieContainer yummycookies = new CookieContainer();
...
}
当我尝试测试您的代码时,它失败了,但在另一次测试中,我发现以下错误“尝试了太多的自动重定向”
在更新代码并再次测试时,它在您提供的url上运行良好,html获取正确。代码在这里
public static string GetHtml2(string urlAddr)
{
if (urlAddr == null || string.IsNullOrEmpty(urlAddr))
{
throw new ArgumentNullException("urlAddr");
}
else
{
string result;
//1.Create the request object
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddr);
//request.AllowAutoRedirect = true;
//request.MaximumAutomaticRedirections = 200;
request.Proxy = null;
request.UseDefaultCredentials = true;
//2.Add the container with the active
CookieContainer cc = new CookieContainer();
//3.Must assing a cookie container for the request to pull the cookies
request.CookieContainer = cc;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
result = sr.ReadToEnd();
//Close and clean up the StreamReader
sr.Close();
}
return result;
}
}
希望一切正常。当我尝试测试您的代码时,它失败了,但在另一次测试中,我发现以下错误“尝试了太多的自动重定向”
在更新代码并再次测试时,它在您提供的url上运行良好,html获取正确。代码在这里
public static string GetHtml2(string urlAddr)
{
if (urlAddr == null || string.IsNullOrEmpty(urlAddr))
{
throw new ArgumentNullException("urlAddr");
}
else
{
string result;
//1.Create the request object
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddr);
//request.AllowAutoRedirect = true;
//request.MaximumAutomaticRedirections = 200;
request.Proxy = null;
request.UseDefaultCredentials = true;
//2.Add the container with the active
CookieContainer cc = new CookieContainer();
//3.Must assing a cookie container for the request to pull the cookies
request.CookieContainer = cc;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
result = sr.ReadToEnd();
//Close and clean up the StreamReader
sr.Close();
}
return result;
}
}
希望这一切正常。但它在我的电脑上正常工作。(我刚刚必须声明静态CookieContainer yummycookies=new CookieContainer();
)您正在点击哪个链接?它可以完美地工作,但没有其他链接。没有其他链接?你确定吗?它适合我http://inkdispatch.com/brother
http://www.google.com
http://stackoverflow.com
等等。我想说的是,网站中没有其他链接,除了hompage之外的每一个链接都会一次又一次地点击自己,直到我发现上面提到的错误。只有主页返回HTML很好,:(.有什么建议吗?他们可能会检查一些国家的ip以避免吗?然后我又在IE上看到了html,所以非常混乱。但它在我的电脑上正常工作。(我只需要声明static CookieContainer yummycookies=new CookieContainer();
)你点击的是哪个链接?它非常适合但没有其他链接。没有其他链接?你确定吗?它适合我http://inkdispatch.com/brother
http://www.google.com
http://stackoverflow.com
等等。我想说的是,网站上没有其他链接,除了hompage之外的每一个链接,都会再次点击自身,并出现一个新的链接获取,直到我得到上述错误。只有主页返回HTML罚款,:(.有什么建议吗。他们可能会在一些国家对ip进行检查以避免这种情况吗?然后我又在IE上看到了html,所以它非常混乱。谢谢你,同样的问题,我认为我这边的一些internet设置可能是:当我这样做时它起作用了:System.Net.ServicePointManager.Expect100Continue=false;WebHeaderCollection-myWebHeaderCollection=request.Headers;//在request.myWebHeaderCollection.Add(“接受语言:en-US”);myWebHeaderCollection.Add(“接受编码”,“gzip,deflate”);myWebHeaderCollection.Add(“Cookie”,“zenid=9ea4d211ba2aa64cbaa148df5de4ab10”);在大多数情况下,如果启用了编码“接受编码”,“gzip,deflate”实时抓取将不起作用,尤其是在目标网站启用这种编码的情况下。是的,我根本不使用这种编码,但在这种情况下,我必须复制一切,只有它起作用。谢谢你,同样的问题,我想我这边的一些internet设置可能是:当我这样做时它起作用了:System.Net.ServicePointManager.Expect100Continue=false;WebHeaderCollection myWebHeaderCollection=request.Headers;//在请求中添加接受语言标题(丹麦语)。myWebHeaderCollection.Add(“接受语言:en-US”);myWebHeaderCollection.Add(“接受编码”、“gzip、deflate”);myWebHeaderCollection.Add(“Cookie”,“zenid=9ea4d211ba2aa64cbaa148df5de4ab10”);在大多数情况下,如果启用了编码“接受编码”,“gzip,deflate”实时抓取将不起作用,特别是在目标网站启用这种编码的情况下。是的,我根本不使用这种编码,但在这种情况下,我必须复制一切,只有它起作用。