Ruby 访问最近的表行及其数据

Ruby 访问最近的表行及其数据,ruby,web-scraping,nokogiri,Ruby,Web Scraping,Nokogiri,我正在创建一个基于上一场比赛结果的小应用程序,或者最后一行的游戏数据(赢/输和游戏编号) 我的问题是访问最后一行的第一列(最近玩的游戏)。这是如何实现的 require 'open-uri' class BrooklynPizzaController < ApplicationController def index # URL for dynamic content url = "http://www.basketball-reference.com/teams/

我正在创建一个基于上一场比赛结果的小应用程序,或者最后一行的游戏数据(赢/输和游戏编号)

我的问题是访问最后一行的第一列(最近玩的游戏)。这是如何实现的

require 'open-uri'

class BrooklynPizzaController < ApplicationController

  def index
    # URL for dynamic content
    url = "http://www.basketball-reference.com/teams/BRK/2015_games.html"

    # Open URL using nokogiri
    doc = Nokogiri::HTML(open(url))

    # Scrape result from Web site
    @result = doc.css("#teams_games").xpath("//table/tbody/tr/td[8]/text()")

    # IN PROGRESS - Get date of last game played
    @result_date = doc.xpath('//table/tbody/tr/td[2]/a/text()') do |link|
      @result_date[link.text.strip] = link['a']
    end


    ###############################################################
    # IN PROGRESS - Get number of last game played from 1st column
    # doc.xpath('//table/tbody/tr/td[1]/text()') do |game|
    #   last_game_number = 
    # end
    ################################################################

    # @result_date = doc.css("#teams_games").xpath("//table/tbody/tr/td[2]/text()")
    # Set date to current
    @date = Date.today

    # Get date of last game played
    if (@result.last.next == nil)
      flag = doc.xpath("//table/tbody/tr[#{@result}]")
      @result_date = doc.xpath("//table/tbody/tr#{flag}/td[2]/a/text()")
    end
  end
end

需要“打开uri”
类BrooklynPizzaController<应用程序控制器
def索引
#动态内容的URL
url=”http://www.basketball-reference.com/teams/BRK/2015_games.html"
#使用nokogiri打开URL
doc=Nokogiri::HTML(打开(url))
#从网站中刮取结果
@result=doc.css(“#teams_games”).xpath(//table/tbody/tr/td[8]/text()”)
#正在进行-获取上次玩游戏的日期
@result_date=doc.xpath('//table/tbody/tr/td[2]/a/text()')do|link|
@结果_日期[link.text.strip]=链接['a']
结束
###############################################################
#正在进行-从第1列中获取上次玩的游戏数
#doc.xpath('//table/tbody/tr/td[1]/text()')做游戏|
#最后一场比赛号码=
#结束
################################################################
#@result_date=doc.css(“#teams_games”).xpath(//table/tbody/tr/td[2]/text())
#将日期设置为当前日期
@date=今天的日期
#获取上次玩游戏的日期
如果(@result.last.next==nil)
flag=doc.xpath(“//table/tbody/tr[#{@result}]”)
@result_date=doc.xpath(“//table/tbody/tr{flag}/td[2]/a/text()”)
结束
结束
结束

请让我知道我给您提供的信息有哪些不足,因为我觉得我遗漏了一些东西。

要了解这一行,您可以这样做:

win_loss_tds = doc.css("#teams_games tbody tr td:nth-child(8):not(:empty)").last
last_win_loss_row = win_loss_tds.last.parent
game_num_col = last_win_loss_row.at("td:first-child")
game_num = game_num_col.text.to_i
# => 82
time_col = last_win_loss_row.at("td:nth-child(3)")
date_time = DateTime.parse("#{date_col.text} #{time_col.text}")
# => 2015-04-15T08:00:00-03:00
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.basketball-reference.com/teams/BRK/2015_games.html"))

latest_score_row = doc.search('//tr/td/a[contains(.,"Box Score")]/../..').last
latest_text = latest_score_row.search('td').map(&:text)
# => ["13",
#     "Sat, Nov 22, 2014",
#     "8:30p EST",
#     "",
#     "Box Score",
#     "@",
#     "San Antonio Spurs",
#     "L",
#     "",
#     "87",
#     "99",
#     "5",
#     "8",
#     "L 1",
#     ""]
毫无疑问,在单个XPath表达式中有一种方法可以做到这一点,但我将把它作为练习留给读者,因为我不喜欢XPath

要从第一列中获取游戏编号,请执行以下操作:

win_loss_tds = doc.css("#teams_games tbody tr td:nth-child(8):not(:empty)").last
last_win_loss_row = win_loss_tds.last.parent
game_num_col = last_win_loss_row.at("td:first-child")
game_num = game_num_col.text.to_i
# => 82
time_col = last_win_loss_row.at("td:nth-child(3)")
date_time = DateTime.parse("#{date_col.text} #{time_col.text}")
# => 2015-04-15T08:00:00-03:00
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.basketball-reference.com/teams/BRK/2015_games.html"))

latest_score_row = doc.search('//tr/td/a[contains(.,"Box Score")]/../..').last
latest_text = latest_score_row.search('td').map(&:text)
# => ["13",
#     "Sat, Nov 22, 2014",
#     "8:30p EST",
#     "",
#     "Box Score",
#     "@",
#     "San Antonio Spurs",
#     "L",
#     "",
#     "87",
#     "99",
#     "5",
#     "8",
#     "L 1",
#     ""]
要从第二列获取日期,请执行以下操作:

date_col = last_win_loss_row.at("td:nth-child(2)") # XPath: td[2]
date = DateTime.parse(date_col.text)
# => 2015-04-15T00:00:00+00:00
如果需要日期和时间,可以执行以下操作:

win_loss_tds = doc.css("#teams_games tbody tr td:nth-child(8):not(:empty)").last
last_win_loss_row = win_loss_tds.last.parent
game_num_col = last_win_loss_row.at("td:first-child")
game_num = game_num_col.text.to_i
# => 82
time_col = last_win_loss_row.at("td:nth-child(3)")
date_time = DateTime.parse("#{date_col.text} #{time_col.text}")
# => 2015-04-15T08:00:00-03:00
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.basketball-reference.com/teams/BRK/2015_games.html"))

latest_score_row = doc.search('//tr/td/a[contains(.,"Box Score")]/../..').last
latest_text = latest_score_row.search('td').map(&:text)
# => ["13",
#     "Sat, Nov 22, 2014",
#     "8:30p EST",
#     "",
#     "Box Score",
#     "@",
#     "San Antonio Spurs",
#     "L",
#     "",
#     "87",
#     "99",
#     "5",
#     "8",
#     "L 1",
#     ""]
好吧,我会这样做:

win_loss_tds = doc.css("#teams_games tbody tr td:nth-child(8):not(:empty)").last
last_win_loss_row = win_loss_tds.last.parent
game_num_col = last_win_loss_row.at("td:first-child")
game_num = game_num_col.text.to_i
# => 82
time_col = last_win_loss_row.at("td:nth-child(3)")
date_time = DateTime.parse("#{date_col.text} #{time_col.text}")
# => 2015-04-15T08:00:00-03:00
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.basketball-reference.com/teams/BRK/2015_games.html"))

latest_score_row = doc.search('//tr/td/a[contains(.,"Box Score")]/../..').last
latest_text = latest_score_row.search('td').map(&:text)
# => ["13",
#     "Sat, Nov 22, 2014",
#     "8:30p EST",
#     "",
#     "Box Score",
#     "@",
#     "San Antonio Spurs",
#     "L",
#     "",
#     "87",
#     "99",
#     "5",
#     "8",
#     "L 1",
#     ""]
但是YMMV


它是如何工作的?容易的。它在包含“Box Score”的页面中查找
节点,然后,对于找到的每个节点,将两个级别备份到
节点,并向Nokogiri/Ruby返回一个数组<代码>最后一个获取找到的最后一个

然后,只需在该行中查找
节点并获取它们的文本

然后,时间戳就是从数组中提取日期和时间,对“am/pm”进行一点按摩,然后让Ruby构建一个对象:

latest_time = Time.strptime(             
  [
    latest_text[1],                      # => "Sat, Nov 22, 2014"
    latest_text[2].sub(/([ap])/, '\1m')  # => "8:30pm EST"
  ].join(' '),                           # => "Sat, Nov 22, 2014 8:30pm EST"
  '%a, %b %d, %Y %H:%M%P %Z'             # => "%a, %b %d, %Y %H:%M%P %Z"
)                                        # => 2014-11-22 18:30:00 -0700

last_row
应该指的是第x行(在本例中,在图像中表示为row/game 13),即第8列中最后一行,带有W或L(标志),而不是整个表中的最后一行。除非选择器不明确且可能与XPath混淆,否则不必使用
at_css
(或
css
)。在(或搜索)中较短的
通常会做正确的事情。另外,如果速度很重要,
parse
比指定日期格式和使用
strtime
@m\u antis更新了我的答案要慢得多。@theTinMan谢谢你的提示。考虑到获取和解析每个页面需要花费多少时间,我怀疑
DateTime.parse
开销是无关紧要的,但无论如何这是一个很好的提示。@Jordan您的答案唯一的问题是这里的
last
方法:
win\u loss\u tds=doc.css(“#teams\u games tbody tr td:nth child(8):not(:empty)”).last
我删除了它,我的代码运行得很干净。