Ruby 访问最近的表行及其数据
我正在创建一个基于上一场比赛结果的小应用程序,或者最后一行的游戏数据(赢/输和游戏编号) 我的问题是访问最后一行的第一列(最近玩的游戏)。这是如何实现的Ruby 访问最近的表行及其数据,ruby,web-scraping,nokogiri,Ruby,Web Scraping,Nokogiri,我正在创建一个基于上一场比赛结果的小应用程序,或者最后一行的游戏数据(赢/输和游戏编号) 我的问题是访问最后一行的第一列(最近玩的游戏)。这是如何实现的 require 'open-uri' class BrooklynPizzaController < ApplicationController def index # URL for dynamic content url = "http://www.basketball-reference.com/teams/
require 'open-uri'
class BrooklynPizzaController < ApplicationController
def index
# URL for dynamic content
url = "http://www.basketball-reference.com/teams/BRK/2015_games.html"
# Open URL using nokogiri
doc = Nokogiri::HTML(open(url))
# Scrape result from Web site
@result = doc.css("#teams_games").xpath("//table/tbody/tr/td[8]/text()")
# IN PROGRESS - Get date of last game played
@result_date = doc.xpath('//table/tbody/tr/td[2]/a/text()') do |link|
@result_date[link.text.strip] = link['a']
end
###############################################################
# IN PROGRESS - Get number of last game played from 1st column
# doc.xpath('//table/tbody/tr/td[1]/text()') do |game|
# last_game_number =
# end
################################################################
# @result_date = doc.css("#teams_games").xpath("//table/tbody/tr/td[2]/text()")
# Set date to current
@date = Date.today
# Get date of last game played
if (@result.last.next == nil)
flag = doc.xpath("//table/tbody/tr[#{@result}]")
@result_date = doc.xpath("//table/tbody/tr#{flag}/td[2]/a/text()")
end
end
end
需要“打开uri”
类BrooklynPizzaController<应用程序控制器
def索引
#动态内容的URL
url=”http://www.basketball-reference.com/teams/BRK/2015_games.html"
#使用nokogiri打开URL
doc=Nokogiri::HTML(打开(url))
#从网站中刮取结果
@result=doc.css(“#teams_games”).xpath(//table/tbody/tr/td[8]/text()”)
#正在进行-获取上次玩游戏的日期
@result_date=doc.xpath('//table/tbody/tr/td[2]/a/text()')do|link|
@结果_日期[link.text.strip]=链接['a']
结束
###############################################################
#正在进行-从第1列中获取上次玩的游戏数
#doc.xpath('//table/tbody/tr/td[1]/text()')做游戏|
#最后一场比赛号码=
#结束
################################################################
#@result_date=doc.css(“#teams_games”).xpath(//table/tbody/tr/td[2]/text())
#将日期设置为当前日期
@date=今天的日期
#获取上次玩游戏的日期
如果(@result.last.next==nil)
flag=doc.xpath(“//table/tbody/tr[#{@result}]”)
@result_date=doc.xpath(“//table/tbody/tr{flag}/td[2]/a/text()”)
结束
结束
结束
请让我知道我给您提供的信息有哪些不足,因为我觉得我遗漏了一些东西。要了解这一行,您可以这样做:
win_loss_tds = doc.css("#teams_games tbody tr td:nth-child(8):not(:empty)").last
last_win_loss_row = win_loss_tds.last.parent
game_num_col = last_win_loss_row.at("td:first-child")
game_num = game_num_col.text.to_i
# => 82
time_col = last_win_loss_row.at("td:nth-child(3)")
date_time = DateTime.parse("#{date_col.text} #{time_col.text}")
# => 2015-04-15T08:00:00-03:00
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.basketball-reference.com/teams/BRK/2015_games.html"))
latest_score_row = doc.search('//tr/td/a[contains(.,"Box Score")]/../..').last
latest_text = latest_score_row.search('td').map(&:text)
# => ["13",
# "Sat, Nov 22, 2014",
# "8:30p EST",
# "",
# "Box Score",
# "@",
# "San Antonio Spurs",
# "L",
# "",
# "87",
# "99",
# "5",
# "8",
# "L 1",
# ""]
毫无疑问,在单个XPath表达式中有一种方法可以做到这一点,但我将把它作为练习留给读者,因为我不喜欢XPath
要从第一列中获取游戏编号,请执行以下操作:
win_loss_tds = doc.css("#teams_games tbody tr td:nth-child(8):not(:empty)").last
last_win_loss_row = win_loss_tds.last.parent
game_num_col = last_win_loss_row.at("td:first-child")
game_num = game_num_col.text.to_i
# => 82
time_col = last_win_loss_row.at("td:nth-child(3)")
date_time = DateTime.parse("#{date_col.text} #{time_col.text}")
# => 2015-04-15T08:00:00-03:00
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.basketball-reference.com/teams/BRK/2015_games.html"))
latest_score_row = doc.search('//tr/td/a[contains(.,"Box Score")]/../..').last
latest_text = latest_score_row.search('td').map(&:text)
# => ["13",
# "Sat, Nov 22, 2014",
# "8:30p EST",
# "",
# "Box Score",
# "@",
# "San Antonio Spurs",
# "L",
# "",
# "87",
# "99",
# "5",
# "8",
# "L 1",
# ""]
要从第二列获取日期,请执行以下操作:
date_col = last_win_loss_row.at("td:nth-child(2)") # XPath: td[2]
date = DateTime.parse(date_col.text)
# => 2015-04-15T00:00:00+00:00
如果需要日期和时间,可以执行以下操作:
win_loss_tds = doc.css("#teams_games tbody tr td:nth-child(8):not(:empty)").last
last_win_loss_row = win_loss_tds.last.parent
game_num_col = last_win_loss_row.at("td:first-child")
game_num = game_num_col.text.to_i
# => 82
time_col = last_win_loss_row.at("td:nth-child(3)")
date_time = DateTime.parse("#{date_col.text} #{time_col.text}")
# => 2015-04-15T08:00:00-03:00
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.basketball-reference.com/teams/BRK/2015_games.html"))
latest_score_row = doc.search('//tr/td/a[contains(.,"Box Score")]/../..').last
latest_text = latest_score_row.search('td').map(&:text)
# => ["13",
# "Sat, Nov 22, 2014",
# "8:30p EST",
# "",
# "Box Score",
# "@",
# "San Antonio Spurs",
# "L",
# "",
# "87",
# "99",
# "5",
# "8",
# "L 1",
# ""]
好吧,我会这样做:
win_loss_tds = doc.css("#teams_games tbody tr td:nth-child(8):not(:empty)").last
last_win_loss_row = win_loss_tds.last.parent
game_num_col = last_win_loss_row.at("td:first-child")
game_num = game_num_col.text.to_i
# => 82
time_col = last_win_loss_row.at("td:nth-child(3)")
date_time = DateTime.parse("#{date_col.text} #{time_col.text}")
# => 2015-04-15T08:00:00-03:00
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.basketball-reference.com/teams/BRK/2015_games.html"))
latest_score_row = doc.search('//tr/td/a[contains(.,"Box Score")]/../..').last
latest_text = latest_score_row.search('td').map(&:text)
# => ["13",
# "Sat, Nov 22, 2014",
# "8:30p EST",
# "",
# "Box Score",
# "@",
# "San Antonio Spurs",
# "L",
# "",
# "87",
# "99",
# "5",
# "8",
# "L 1",
# ""]
但是YMMV
它是如何工作的?容易的。它在包含“Box Score”的页面中查找
节点,然后,对于找到的每个节点,将两个级别备份到
节点,并向Nokogiri/Ruby返回一个数组<代码>最后一个获取找到的最后一个
然后,只需在该行中查找
节点并获取它们的文本
然后,时间戳就是从数组中提取日期和时间,对“am/pm”进行一点按摩,然后让Ruby构建一个对象:
latest_time = Time.strptime(
[
latest_text[1], # => "Sat, Nov 22, 2014"
latest_text[2].sub(/([ap])/, '\1m') # => "8:30pm EST"
].join(' '), # => "Sat, Nov 22, 2014 8:30pm EST"
'%a, %b %d, %Y %H:%M%P %Z' # => "%a, %b %d, %Y %H:%M%P %Z"
) # => 2014-11-22 18:30:00 -0700
last_row
应该指的是第x行(在本例中,在图像中表示为row/game 13),即第8列中最后一行,带有W或L(标志),而不是整个表中的最后一行。除非选择器不明确且可能与XPath混淆,否则不必使用at_css
(或css
)。在(或搜索)中较短的通常会做正确的事情。另外,如果速度很重要,parse
比指定日期格式和使用strtime
@m\u antis更新了我的答案要慢得多。@theTinMan谢谢你的提示。考虑到获取和解析每个页面需要花费多少时间,我怀疑DateTime.parse
开销是无关紧要的,但无论如何这是一个很好的提示。@Jordan您的答案唯一的问题是这里的last
方法:win\u loss\u tds=doc.css(“#teams\u games tbody tr td:nth child(8):not(:empty)”).last
我删除了它,我的代码运行得很干净。