Ruby 与Nokogiri的刮擦问题

Ruby 与Nokogiri的刮擦问题,ruby,Ruby,我试图写一个简单的脚本,告诉我x秀的下一集什么时候发行 以下是我到目前为止的情况: require 'rubygems' require 'nokogiri' require 'open-uri' url = "http://www.tv.com/shows/game-of-thrones/episodes/" doc = Nokogiri::HTML(open(url)) puts doc.at_css('h1').text airdate = doc.at_css('.highligh

我试图写一个简单的脚本,告诉我x秀的下一集什么时候发行

以下是我到目前为止的情况:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))

puts doc.at_css('h1').text
airdate =  doc.at_css('.highlight_date span , h1').text
date = /\W/.match(airdate)
puts date
当我运行此程序时,它返回的结果是: 权力游戏

我在那里使用的css选择器给出的行airdate是/xx/xx/xx,但是我只想输入日期,因此我使用了/\W/尽管我在这里可能完全错了

因此,基本上我希望它只打印节目标题和下一集的日期。

您可以执行以下操作:-

require 'nokogiri'
require 'open-uri'

url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))

# under season4 currently 7 episodes present, which may change later.
doc.css('#season-4-eps > li').size # => 7

# collect season4 episodes and then their dates and titles
doc.css('#season-4-eps > li').collect { |node| [node.css('.title').text,node.css('.date').text] }
# => [["Mockingbird", "5/18/14"],
#     ["The Laws of God and Men", "5/11/14"],
#     ["First of His Name", "5/4/14"],
#     ["Oathkeeper", "4/27/14"],
#     ["Breaker of Chains", "4/20/14"],
#     ["The Lion and the Rose", "4/13/14"],
#     ["Two Swords", "4/6/14"]]
# how many sessions are present
latest_session = doc.css(".filters > li[data-season]").size # => 4

# collect season4 episodes and then their dates and titles
doc.css("#season-#{latest_session}-eps > li").collect do |node| 
  p [node.css('.title').text,node.css('.date').text] 
end
# >> ["The Mountain and the Viper", "6/1/14"]
# >> ["Mockingbird", "5/18/14"]
# >> ["The Laws of God and Men", "5/11/14"]
# >> ["First of His Name", "5/4/14"]
# >> ["Oathkeeper", "4/27/14"]
# >> ["Breaker of Chains", "4/20/14"]
# >> ["The Lion and the Rose", "4/13/14"]
# >> ["Two Swords", "4/6/14"]
再看看网页,我可以看到,它总是打开与最新一季的数据。因此,上述代码可以修改如下:-

require 'nokogiri'
require 'open-uri'

url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))

# under season4 currently 7 episodes present, which may change later.
doc.css('#season-4-eps > li').size # => 7

# collect season4 episodes and then their dates and titles
doc.css('#season-4-eps > li').collect { |node| [node.css('.title').text,node.css('.date').text] }
# => [["Mockingbird", "5/18/14"],
#     ["The Laws of God and Men", "5/11/14"],
#     ["First of His Name", "5/4/14"],
#     ["Oathkeeper", "4/27/14"],
#     ["Breaker of Chains", "4/20/14"],
#     ["The Lion and the Rose", "4/13/14"],
#     ["Two Swords", "4/6/14"]]
# how many sessions are present
latest_session = doc.css(".filters > li[data-season]").size # => 4

# collect season4 episodes and then their dates and titles
doc.css("#season-#{latest_session}-eps > li").collect do |node| 
  p [node.css('.title').text,node.css('.date').text] 
end
# >> ["The Mountain and the Viper", "6/1/14"]
# >> ["Mockingbird", "5/18/14"]
# >> ["The Laws of God and Men", "5/11/14"]
# >> ["First of His Name", "5/4/14"]
# >> ["Oathkeeper", "4/27/14"]
# >> ["Breaker of Chains", "4/20/14"]
# >> ["The Lion and the Rose", "4/13/14"]
# >> ["Two Swords", "4/6/14"]

根据评论,OP可能有兴趣从网页的下一集框中获取数据。以下是一种同样的方法:

require 'nokogiri'
require 'open-uri'

url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))

hash = {}
doc.css('div[class ~= next_episode] div.highlight_info').tap do |node|
  hash['date'] = node.css('p.highlight_date > span').text[/\d{1,2}\/\d{1,2}\/\d{4}/]
  hash['title'] = node.css('div.highlight_name > a').text
end

hash # => {"date"=>"5/18/2014", "title"=>"Mockingbird"}
值得一读

向块生成
x
,然后返回x。此方法的主要目的是“进入”方法链,以便对链中的中间结果执行操作


另外,请阅读以了解选择器如何使用方法
#css

,谢谢您的回答!然而,我如何改变它,使它只收集最新的一集,无限期地(因为它将自动从第四季转到第五季,而无需更改代码)。我不知道上面的代码是如何工作的,但您可以使用。下一集类获取与最新集相关的HTMl,并在其中提供相关信息。@HarryLucas OK。页面上的数据是通过AJAX加载的,Nokogiri不支持AJAX。要做到这一点,您可以使用SeleniumWebDriver执行AJAX调用,然后使用nokogiri执行html页面并从中获取数据。@amitamb这样做会更好一些。我现在拥有的是
将doc.at_css('h1')。text airdate=doc.at_css('next_eposion')。text将airdate.scan(/\d+/)
放在单独的行上(如2014年下一行第18行第5行),我如何将它们放在同一行上?@HarryLucas这段代码和我的代码有什么问题?