Web 在网页上抓取用户名的好方法_Web_Scrape

Web 在网页上抓取用户名的好方法

web

Web 在网页上抓取用户名的好方法,web,scrape,Web,Scrape,我想从youtube评论中删除用户名，如页面中：我想获得所有用户名/显示名，如：“fedfields”、“mystik dread” 以及相应的链接（当您单击“fedfields”时，它将链接到其配置文件）我想使用自动化bash脚本来删除它们我有以下问题： 1我最初的方法是编写自动脚本，使用wget下载页面，然后使用regex处理页面以获得这些名称，但这样，我需要下载整个页面，每个页面都有几MB，如果我下载很多页面，它会占用很多空间，有没有更好的方法 2有很多页面，比如在链接中，有7个页

我想从youtube评论中删除用户名，如页面中：

我想获得所有用户名/显示名，如：“fedfields”、“mystik dread” 以及相应的链接（当您单击“fedfields”时，它将链接到其配置文件）我想使用自动化bash脚本来删除它们我有以下问题：

1我最初的方法是编写自动脚本，使用wget下载页面，然后使用regex处理页面以获得这些名称，但这样，我需要下载整个页面，每个页面都有几MB，如果我下载很多页面，它会占用很多空间，有没有更好的方法

2有很多页面，比如在链接中，有7个页面，是否可以在一个页面中获取所有页面？

您可以在C#应用程序中使用HtmlAgilityPack

        HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = web.Load(Url);
        IEnumerable<HtmlNode> userNames = doc.DocumentNode.Descendants("a").Where(
            d => d.Attributes.Contains("class") &&   
            d.Attributes["class"].Value.Contains("yt-user-name"));

然后，您可以通过HTMLAgilityPack读取流并获取用户名。

您可以在C#应用程序中使用HTMLAgilityPack

        HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = web.Load(Url);
        IEnumerable<HtmlNode> userNames = doc.DocumentNode.Descendants("a").Where(
            d => d.Attributes.Contains("class") &&   
            d.Attributes["class"].Value.Contains("yt-user-name"));

然后，您可以通过HTMLAgilityPack读取流并获取用户名。

使用mashape上的Scrapegat将所有用户名作为json对象返回：）

在mashape上使用scrapegat将所有用户名作为json对象返回：）

这样做：

import re
import sys
import time
import urllib2

html = True

argv_list = sys.argv
if len(argv_list) == 2:
    vid = argv_list[1]
else:
    vid = "mIA0W69U2_Y"

regex = re.compile("<span class=\"author.*?<a href=\"(.*?)\".*? dir=\"ltr\">(.*?)</a>", re.DOTALL | re.UNICODE | re.IGNORECASE)

index = 1
author_lists = []
t1 = time.time()
print "######################### Start #########################"

while 1:
    url = "http://www.youtube.com/watch_ajax?action_get_comments=1&v="+vid+"&commenttype=everything&source=w&page_size=500&p="+str(index)+"&format=XML"
    print "Retrieving page "+str(index)+": ", url
    o = urllib2.urlopen(url)
    r = o.read()
    elements = regex.findall(r)
    author_list = []
    for x, y in elements:

        if x.startswith("http://") or x.startswith("https://"):
            continue
        xx = "".join(["http://www.youtube.com", x])
        href = xx.strip()
        #print href


        if "</span>" not in y :
            uname = y.strip()
        else:
            uname = y.split("</span>")[0].strip()

        if uname.startswith("<a"):
            continue

        if not uname or not href:
            continue

        if html:
            #1 output html
            author = "".join(["<a href=\"", href, "\">", uname, "</a>"])
        else:
            #2 output txt
            author = " ".join([uname, href])

        author_list.append(author)

    t = "%02d:%02d:%02d" % reduce(lambda ll,b : divmod(ll[0],b) + ll[1:], [(time.time()-t1,),60,60])
    print "".join(["Time passed: ", t])
    if not author_list:
        break
    else:
        author_lists.extend(author_list)
    index+=1
    #break #uncomment it if you only want to test one page

print "######################### Finished #########################"
print "Total comments: ", len(author_lists)
if author_lists:
    author_lists.sort()
    last = author_lists[-1]
    for i in range(len(author_lists)-2, -1, -1):
        if last == author_lists[i]:
            del author_lists[i]
        else:
            last = author_lists[i]
    if html:
        authors = "<br>".join(author_lists)
        authors = "".join(["<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8'><body>", authors, "</body></html>"])
        fname = vid+".html"
    else:
        authors = "\n".join(author_lists)
        fname = vid+".txt"

    #print "Authors: ", authors
    print "Total commenters: ", len(author_lists)



    oo = open(fname, "w")
    oo.write(authors)
    oo.close()
print "######################### Exist #########################"

重新导入
导入系统
导入时间
导入urllib2
html=True
argv_list=sys.argv
如果len（argv_列表）==2：
vid=argv_列表[1]
其他：
vid=“mIA0W69U2_Y”
regex=re.compile（“执行此操作：
import re
import sys
import time
import urllib2

html = True

argv_list = sys.argv
if len(argv_list) == 2:
    vid = argv_list[1]
else:
    vid = "mIA0W69U2_Y"

regex = re.compile("<span class=\"author.*?<a href=\"(.*?)\".*? dir=\"ltr\">(.*?)</a>", re.DOTALL | re.UNICODE | re.IGNORECASE)

index = 1
author_lists = []
t1 = time.time()
print "######################### Start #########################"

while 1:
    url = "http://www.youtube.com/watch_ajax?action_get_comments=1&v="+vid+"&commenttype=everything&source=w&page_size=500&p="+str(index)+"&format=XML"
    print "Retrieving page "+str(index)+": ", url
    o = urllib2.urlopen(url)
    r = o.read()
    elements = regex.findall(r)
    author_list = []
    for x, y in elements:

        if x.startswith("http://") or x.startswith("https://"):
            continue
        xx = "".join(["http://www.youtube.com", x])
        href = xx.strip()
        #print href


        if "</span>" not in y :
            uname = y.strip()
        else:
            uname = y.split("</span>")[0].strip()

        if uname.startswith("<a"):
            continue

        if not uname or not href:
            continue

        if html:
            #1 output html
            author = "".join(["<a href=\"", href, "\">", uname, "</a>"])
        else:
            #2 output txt
            author = " ".join([uname, href])

        author_list.append(author)

    t = "%02d:%02d:%02d" % reduce(lambda ll,b : divmod(ll[0],b) + ll[1:], [(time.time()-t1,),60,60])
    print "".join(["Time passed: ", t])
    if not author_list:
        break
    else:
        author_lists.extend(author_list)
    index+=1
    #break #uncomment it if you only want to test one page

print "######################### Finished #########################"
print "Total comments: ", len(author_lists)
if author_lists:
    author_lists.sort()
    last = author_lists[-1]
    for i in range(len(author_lists)-2, -1, -1):
        if last == author_lists[i]:
            del author_lists[i]
        else:
            last = author_lists[i]
    if html:
        authors = "<br>".join(author_lists)
        authors = "".join(["<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8'><body>", authors, "</body></html>"])
        fname = vid+".html"
    else:
        authors = "\n".join(author_lists)
        fname = vid+".txt"

    #print "Authors: ", authors
    print "Total commenters: ", len(author_lists)



    oo = open(fname, "w")
    oo.write(authors)
    oo.close()
print "######################### Exist #########################"

重新导入
导入系统
导入时间
导入urllib2
html=True
argv_list=sys.argv
如果len（argv_列表）==2：
vid=argv_列表[1]
其他：
vid=“mIA0W69U2_Y”
regex=re.compile（“C#也有助于此（尽管HAP和WebRequest更好）：
C#也可以通过这种方式提供帮助（尽管HAP和WebRequest更好）：
为名称字段和其他要提取的字段编写rssfeeds使用自动插件设置爬虫程序按照以下步骤编写rssfeeds为名称字段和其他要提取的字段编写rssfeeds使用自动插件设置爬虫程序按照以下步骤编写rssfeedsnd gems nokogiri和开放uri
require 'nokogiri'
require 'open-uri'
url="https://www.youtube.com/all_comments?v=mIA0W69U2_Y"
dom=Nokogiri::HTML(open(url))
dom.xpath("//div[@class='comment-entry']").each do |comment|
  username=comment.xpath(".//a[contains(@class,'user-name')]").first
  username=username.content.chomp.strip if username
  profilelink=comment.xpath(".//a[contains(@class,'user-name')]/@href").first
  profilelink=profilelink.content.chomp.strip if profilelink
  profilelink="http://www.youtube.com"+profilelink if profilelink.match(/^\//)
  puts "#{username} #{profilelink}" if username and profilelink
end

有关更多信息，请访问
以下是使用ruby和gems nokogiri以及开放uri的简单解决方案
require 'nokogiri'
require 'open-uri'
url="https://www.youtube.com/all_comments?v=mIA0W69U2_Y"
dom=Nokogiri::HTML(open(url))
dom.xpath("//div[@class='comment-entry']").each do |comment|
  username=comment.xpath(".//a[contains(@class,'user-name')]").first
  username=username.content.chomp.strip if username
  profilelink=comment.xpath(".//a[contains(@class,'user-name')]/@href").first
  profilelink=profilelink.content.chomp.strip if profilelink
  profilelink="http://www.youtube.com"+profilelink if profilelink.match(/^\//)
  puts "#{username} #{profilelink}" if username and profilelink
end

欲了解更多信息，请访问你将使用什么语言？无论如何，你有减小尺寸的想法。你将使用什么语言？无论如何，你有减小尺寸的想法。是的，我是API的作者：）是的，我是API的作者：）
require 'nokogiri'
require 'open-uri'
url="https://www.youtube.com/all_comments?v=mIA0W69U2_Y"
dom=Nokogiri::HTML(open(url))
dom.xpath("//div[@class='comment-entry']").each do |comment|
  username=comment.xpath(".//a[contains(@class,'user-name')]").first
  username=username.content.chomp.strip if username
  profilelink=comment.xpath(".//a[contains(@class,'user-name')]/@href").first
  profilelink=profilelink.content.chomp.strip if profilelink
  profilelink="http://www.youtube.com"+profilelink if profilelink.match(/^\//)
  puts "#{username} #{profilelink}" if username and profilelink
end