Ruby 如何使用nokogiri每小时清理一个站点
下面列出的是我编写的刮刀代码。我需要帮助将延迟添加到此刮板。我要每小时刮一页Ruby 如何使用nokogiri每小时清理一个站点,ruby,web-scraping,nokogiri,Ruby,Web Scraping,Nokogiri,下面列出的是我编写的刮刀代码。我需要帮助将延迟添加到此刮板。我要每小时刮一页 require 'open-uri' require 'nokogiri' require 'sanitize' class Scraper def initialize(url_to_scrape) @url = url_to_scrape end def scrape # TO DO: change to JSON # page = No
require 'open-uri'
require 'nokogiri'
require 'sanitize'
class Scraper
def initialize(url_to_scrape)
@url = url_to_scrape
end
def scrape
# TO DO: change to JSON
# page = Nokogiri::HTML(open(@url))
puts "Initiating scrape..."
raw_response = open(@url)
json_response = JSON.parse(raw_response.read)
page = Nokogiri::HTML(json_response["html"])
# your page should now be a hash. You need the page["html"]
# Change this to parse the a tags with the class "article_title"
# and build the links array for each href in these article_title links
puts "Scraping links..."
links = page.css(".article_title")
articles = []
# everything else here should work fine.
# Limit the number of links to scrape for testing phase
puts "Building articles collection..."
links.each do |link|
article_url = "http://seekingalpha.com" + link["href"]
article_page = Nokogiri::HTML(open(article_url))
article = {}
article[:company] = article_page.css("#about_primary_stocks").css("a")
article[:content] = article_page.css("#article_content")
article[:content] = Sanitize.clean(article[:content].to_s)
unless article[:content].blank?
articles << article
end
end
puts "Clearing all existing transcripts..."
Transcript.destroy_all
# Iterate over the articles collection and save each record into the database
puts "Saving new transcripts..."
articles.each do |article|
transcript = Transcript.new
transcript.stock_symbol = article[:company].text.to_s
transcript.content = article[:content].to_s
transcript.save
end
#return articles
end
end
需要“打开uri”
需要“nokogiri”
需要“消毒”
类刮刀
def初始化(url_至_刮取)
@url=url\u到\u刮取
结束
def刮除
#要做的事情:更改为JSON
#page=Nokogiri::HTML(打开(@url))
将“启动刮擦…”
原始响应=打开(@url)
json\u response=json.parse(raw\u response.read)
page=Nokogiri::HTML(json_响应[“HTML”])
#您的页面现在应该是散列。您需要页面[“html”]
#将此更改为使用类“article_title”解析a标记
#并为这些文章标题链接中的每个href构建链接数组
放置“刮链…”
links=page.css(“.article\u title”)
条款=[]
#这里的其他一切都应该很好。
#限制测试阶段要刮取的链接数
放置“建筑物品收藏…”
links.each do | link|
文章_url=”http://seekingalpha.com“+链接[”href“]
article_page=Nokogiri::HTML(打开(article_url))
第条={}
article[:company]=article_page.css(“#about_primary_stocks”).css(“a”)
article[:content]=article_page.css(“article_content”)
物品[:内容]=消毒.清洁(物品[:内容].至)
除非文章[:内容]。空白?
articles那么,当您完成刮片时,您将如何处理articles数组
我不确定它是否是您正在寻找的,但我会使用它来安排每小时运行此脚本。
如果您的脚本是一个更大的应用程序的一部分,那么有一个名为gem的整洁的gem,它为cron任务提供了一个ruby包装器
希望能有帮助