Nokogiri Ruby Gem不适用于所有URL_Ruby_Nokogiri

Nokogiri Ruby Gem不适用于所有URL

ruby

Nokogiri Ruby Gem不适用于所有URL,ruby,nokogiri,Ruby,Nokogiri,我正在为最终项目制作自己的网页刮板，需要一些帮助我用的是Nokogiri。网络刮板找到网页上的所有单词，并使用字典计算每个单词的频率，然后返回网站上排名前十的单词。我可以通过在许多网站，因为我想，它仍然会工作。这样我就可以通过http://fox.com，http/cnbc.com，等等。该程序在这些网站上运行良好，但对于某些网站，我遇到了一个错误。例如，http://facebook不起作用，它说禁止重定向以下是我目前的代码： require 'rubygems' require 'nok

我正在为最终项目制作自己的网页刮板，需要一些帮助

我用的是Nokogiri。网络刮板找到网页上的所有单词，并使用字典计算每个单词的频率，然后返回网站上排名前十的单词。我可以通过在许多网站，因为我想，它仍然会工作。这样我就可以通过

http://fox.com

，

http/cnbc.com

，等等。该程序在这些网站上运行良好，但对于某些网站，我遇到了一个错误。例如，

http://facebook

不起作用，它说禁止重定向

以下是我目前的代码：

require 'rubygems'
require 'nokogiri'
require 'open-uri'

class Scraper

  attr_accessor :url, :words, :arguments

  def initialize(*args)
    @words = Hash.new("No Match Found")
    @arguments = args
    compiler
    print_results
  end

  def mechansim(site)
    boring_words = ["the","to", "in","if","of","all","and","the","for","news","is","on","a","this","with","at","continue","more","be","from","could","as","by","he","she","who","what","not",
      "newswidget","newswidgetfooter","pm"]
    page = Nokogiri::HTML(open(site))
    page.search('script').each {|el| el.unlink}
    links = page.css('body').inner_text.downcase.gsub(/[^0-9a-z ]/i, '').split(' ')
    links.each do |x|
      if @words.has_key?(x) === true && boring_words.include?(x) === false
        @words[x] += 1
      else 
        @words[x] =1
      end
    end
    if @arguments[0].length > 0
      compiler
    end
  end

  def compiler
    @arguments.each do |argument| 
      argument = argument[0]
      site = argument
      arguments[0].shift
      mechansim(site)
    end
  end


  def print_results
    puts "------------------------------------------------------------------"
    @words = @words.sort_by {|k, v| v}.reverse.to_h 
    print @words.take(20)
    puts "------------------------------------------------------------------"
  end

end

Scraper.new(["http://foxnews.com"])

使用facebook url的HTTPS版本：

https://facebook.com

有些网站不喜欢被人刮掉。有趣的是：）这与Nokogiri无关，因为它不从站点检索内容，它只解析传递给它的内容。您正在使用OpenURI读取站点，并向其传递一个不完整的URL。Facebook不是

http://facebook

，是

http://facebook.com

。如果URL缺少TLD，您可以编写代码来完成URL，但这通常是不准确或误导性的，您的代码可能会被重定向到您不想要的地方。如果给定了有效的URL，OpenURI将遵循重定向。HTTP不是

https

。请澄清你的意思。问题不是由于https与http，而是因为OP缺少TLD。如果运行代码，您将看到open uri无法处理从

http://facebook.com

。解决方案是使用重定向目标。