Ruby 如何使用nokogiri每小时清理一个站点_Ruby_Web Scraping_Nokogiri

Ruby 如何使用nokogiri每小时清理一个站点

ruby web-scraping

Ruby 如何使用nokogiri每小时清理一个站点,ruby,web-scraping,nokogiri,Ruby,Web Scraping,Nokogiri,下面列出的是我编写的刮刀代码。我需要帮助将延迟添加到此刮板。我要每小时刮一页 require 'open-uri' require 'nokogiri' require 'sanitize' class Scraper def initialize(url_to_scrape) @url = url_to_scrape end def scrape # TO DO: change to JSON # page = No

下面列出的是我编写的刮刀代码。我需要帮助将延迟添加到此刮板。我要每小时刮一页

require 'open-uri' require 'nokogiri' require 'sanitize' class Scraper def initialize(url_to_scrape) @url = url_to_scrape end def scrape # TO DO: change to JSON # page = Nokogiri::HTML(open(@url)) puts "Initiating scrape..." raw_response = open(@url) json_response = JSON.parse(raw_response.read) page = Nokogiri::HTML(json_response["html"]) # your page should now be a hash. You need the page["html"] # Change this to parse the a tags with the class "article_title" # and build the links array for each href in these article_title links puts "Scraping links..." links = page.css(".article_title") articles = [] # everything else here should work fine. # Limit the number of links to scrape for testing phase puts "Building articles collection..." links.each do |link| article_url = "http://seekingalpha.com" + link["href"] article_page = Nokogiri::HTML(open(article_url)) article = {} article[:company] = article_page.css("#about_primary_stocks").css("a") article[:content] = article_page.css("#article_content") article[:content] = Sanitize.clean(article[:content].to_s) unless article[:content].blank? articles << article end end puts "Clearing all existing transcripts..." Transcript.destroy_all # Iterate over the articles collection and save each record into the database puts "Saving new transcripts..." articles.each do |article| transcript = Transcript.new transcript.stock_symbol = article[:company].text.to_s transcript.content = article[:content].to_s transcript.save end #return articles end end

需要“打开uri” 需要“nokogiri” 需要“消毒” 类刮刀 def初始化（url_至_刮取） @url=url\u到\u刮取结束 def刮除 #要做的事情：更改为JSON #page=Nokogiri:：HTML（打开（@url））将“启动刮擦…” 原始响应=打开（@url） json\u response=json.parse（raw\u response.read） page=Nokogiri:：HTML（json_响应[“HTML”]） #您的页面现在应该是散列。您需要页面[“html”] #将此更改为使用类“article_title”解析a标记 #并为这些文章标题链接中的每个href构建链接数组放置“刮链…” links=page.css（“.article\u title”）条款=[] #这里的其他一切都应该很好。 #限制测试阶段要刮取的链接数放置“建筑物品收藏…” links.each do | link| 文章_url=”http://seekingalpha.com“+链接[”href“] article_page=Nokogiri:：HTML（打开（article_url））第条={} article[：company]=article_page.css（“#about_primary_stocks”）.css（“a”） article[：content]=article_page.css（“article_content”）物品[：内容]=消毒.清洁（物品[：内容].至）除非文章[：内容]。空白？ articles那么，当您完成刮片时，您将如何处理articles数组我不确定它是否是您正在寻找的，但我会使用它来安排每小时运行此脚本。如果您的脚本是一个更大的应用程序的一部分，那么有一个名为gem的整洁的gem，它为cron任务提供了一个ruby包装器希望能有帮助