C# 需要帮助理解混乱，同时使网络爬虫获得总链接计数_C#_Regex_Web Crawler_Webrequest

C# 需要帮助理解混乱，同时使网络爬虫获得总链接计数

c# regex web-crawler

C# 需要帮助理解混乱，同时使网络爬虫获得总链接计数,c#,regex,web-crawler,webrequest,C#,Regex,Web Crawler,Webrequest,我试着开始制作网络爬虫。一直进展顺利，直到我弄糊涂了。我编写了以下代码：我通过http://www.google.com作为字符串URL public void crawlURL(string URL, string depth) { if (!checkPageHasBeenCrawled(URL)) { PageContent = getURLContent(URL); MatchCollection matches = Regex.Matc

我试着开始制作网络爬虫。一直进展顺利，直到我弄糊涂了。我编写了以下代码：

我通过

http://www.google.com

作为字符串

URL

public void crawlURL(string URL, string depth)
{
    if (!checkPageHasBeenCrawled(URL))
    {
        PageContent = getURLContent(URL);
        MatchCollection matches = Regex.Matches(PageContent, "href=\"", RegexOptions.IgnoreCase);
        int count = matches.Count;
    }
} 

private string getURLContent(string URL)
{
    string content;
    HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
    request.UserAgent = "Fetching contents Data";
    WebResponse response = request.GetResponse();
    Stream stream = response.GetResponseStream();

    StreamReader reader = new StreamReader(stream);
    content = reader.ReadToEnd();

    reader.Close();
    stream.Close();
    return content;
}

问题:

我正在尝试获取该页面的所有链接(http://www.google.com 或任何其他网站）但我看到的来自正则表达式匹配的链接数量较少。当我手动检查源代码中的单词“href=”时，它给了我19个链接计数。我不明白为什么代码中的单词数减少了。

我修复并测试了您的正则表达式模式。以下各项应能更有效地发挥作用。它从google.ca获得11个匹配项

public void crawlURL(string URL)
        {

            PageContent = getURLContent(URL);
            MatchCollection matches = Regex.Matches(PageContent, "(href=\"https?://[a-z0-9-._~:/?#\\[\\]@!$&'()*+,;=]+(?=\"|$))", RegexOptions.IgnoreCase);
            foreach (Match match in matches)
                Console.WriteLine(match.Value);

            int count = matches.Count;

        }

        private string getURLContent(string URL)
        {
            string content;
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
            request.UserAgent = "Fetching contents Data";
            WebResponse response = request.GetResponse();
            Stream stream = response.GetResponseStream();

            StreamReader reader = new StreamReader(stream);
            content = reader.ReadToEnd();

            reader.Close();
            stream.Close();
            return content;
        }

“手动检查源代码”。您是指

PageContent

的内容，还是在浏览器中打开链接时？因为在后一种情况下，由于个性化，您可能会获得不同的页面。

HttpWebRequest

不解析javascript，因此通过脚本添加的任何链接都不会显示。避免使用正则表达式…使用解析器查看类似问题：另外，您正在搜索

href=“

，但在谷歌上你会发现很多带有

a.href=document…

的javascript，我们的正则表达式无法与之匹配。（它还将忽略

href='

等）