Ruby on rails 抓取包含文本nokogiri xpath的元素
仍然在学习如何使用nokogiri,到目前为止可以通过css元素抓取。有一个页面我想抓取,我想通过Ajax调用获得巴克莱超级联赛的所有结果,但是这在nokogiri上是不可能的 所以我提供的链接对所有不同的联赛都有很多结果,所以我可以只抓取包含在Ruby on rails 抓取包含文本nokogiri xpath的元素,ruby-on-rails,ruby,ruby-on-rails-3,screen-scraping,nokogiri,Ruby On Rails,Ruby,Ruby On Rails 3,Screen Scraping,Nokogiri,仍然在学习如何使用nokogiri,到目前为止可以通过css元素抓取。有一个页面我想抓取,我想通过Ajax调用获得巴克莱超级联赛的所有结果,但是这在nokogiri上是不可能的 所以我提供的链接对所有不同的联赛都有很多结果,所以我可以只抓取包含在 class="competition-title" 到目前为止,我可以像这样获得所有结果 def get_results # Get me all results doc = Nokogiri::HTML(open(RESULTS_URL)) d
class="competition-title"
到目前为止,我可以像这样获得所有结果
def get_results # Get me all results
doc = Nokogiri::HTML(open(RESULTS_URL))
days = doc.css('#results-data h2').each do |h2_tag|
date = Date.parse(h2_tag.text.strip).to_date
matches = h2_tag.xpath('following-sibling::*[1]').css('tr.report')
matches.each do |match|
home_team = match.css('.team-home').text.strip
away_team = match.css('.team-away').text.strip
score = match.css('.score').text.strip
Result.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
end
谢谢你的帮助
编辑
好吧,看来我可以用一些ruby,用select?但不知道如何实现。下面的例子
.select{|th|th.text =~ /Barclays Premier League/}
或者更多的阅读已经说过可以使用xpath
matches = h2_tag.xpath('//th[contains(text(), "Barclays Premier League")]').css('tr.report')
或
已经尝试过xpath方法,但显然是错误的,因为没有任何节省
谢谢我更喜欢一种方法,您可以深入了解您所需要的内容。查看源代码,您需要匹配详细信息:
<td class='match-details'>
<p>
<span class='team-home teams'><a href='...'>Brechin</a></span>
<span class='score'><abbr title='Score'> 0-2 </abbr></span>
<span class='team-away teams'><a href='...'>Alloa</a></span>
</p>
</td>
td/p
就足够了,因为匹配详细信息是唯一包含p
的详细信息,但是如果需要,可以将类添加到td
中
然后,您完全按照自己的方式获取信息:
matches.each do |match|
home_team = match.css('.team-home').text.strip
away_team = match.css('.team-away').text.strip
score = match.css('.score').text.strip
...
end
剩下的一项任务是:获取每次比赛的日期。回顾源代码,您可以返回到第一个包含表,并看到前面的第一个h2
节点拥有它。您可以用XPath表示:
date = match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text
将所有内容放在一起
def get_results
doc = Nokogiri::HTML(open(RESULTS_URL))
matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')
matches.each do |match|
home_team = match.css('.team-home').text.strip
away_team = match.css('.team-away').text.strip
score = match.css('.score').text.strip
date = Date.parse(match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text).to_date
Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
end
end
为了好玩,以下是我将如何转换@Mark Thomas的解决方案:
def get_results
doc = Nokogiri::HTML(open(RESULTS_URL))
doc.search('h2.table-header').each do |h2|
date = Date.parse(h2.text).to_date
next unless h2.at('+ table th[2]').text['Barclays Premier League']
h2.search('+ table tbody tr').each do |tr|
home_team = tr.at('.team-home').text.strip
away_team = tr.at('.team-away').text.strip
score = tr.at('.score').text.strip
Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
end
end
end
通过对这些h2的第一次迭代,您可以得到:
优点:
- 将日期从循环中拉出
- 更简单的表达方式(你可能不会太担心这些) 但是想想跟在你后面的那个人。)
- 额外的几个字节的代码
def get_results
doc = Nokogiri::HTML(open(RESULTS_URL))
matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')
matches.each do |match|
home_team = match.css('.team-home').text.strip
away_team = match.css('.team-away').text.strip
score = match.css('.score').text.strip
date = Date.parse(match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text).to_date
Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
end
end
def get_results
doc = Nokogiri::HTML(open(RESULTS_URL))
doc.search('h2.table-header').each do |h2|
date = Date.parse(h2.text).to_date
next unless h2.at('+ table th[2]').text['Barclays Premier League']
h2.search('+ table tbody tr').each do |tr|
home_team = tr.at('.team-home').text.strip
away_team = tr.at('.team-away').text.strip
score = tr.at('.score').text.strip
Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
end
end
end