Ruby on rails 抓取包含文本nokogiri xpath的元素_Ruby On Rails_Ruby_Ruby On Rails 3_Screen Scraping_Nokogiri

Ruby on rails 抓取包含文本nokogiri xpath的元素

ruby-on-rails ruby ruby-on-rails-3

Ruby on rails 抓取包含文本nokogiri xpath的元素,ruby-on-rails,ruby,ruby-on-rails-3,screen-scraping,nokogiri,Ruby On Rails,Ruby,Ruby On Rails 3,Screen Scraping,Nokogiri,仍然在学习如何使用nokogiri，到目前为止可以通过css元素抓取。有一个页面我想抓取，我想通过Ajax调用获得巴克莱超级联赛的所有结果，但是这在nokogiri上是不可能的所以我提供的链接对所有不同的联赛都有很多结果，所以我可以只抓取包含在 class="competition-title" 到目前为止，我可以像这样获得所有结果 def get_results # Get me all results doc = Nokogiri::HTML(open(RESULTS_URL)) d

仍然在学习如何使用nokogiri，到目前为止可以通过css元素抓取。有一个页面我想抓取，我想通过Ajax调用获得巴克莱超级联赛的所有结果，但是这在nokogiri上是不可能的

所以我提供的链接对所有不同的联赛都有很多结果，所以我可以只抓取包含在

class="competition-title"

到目前为止，我可以像这样获得所有结果

def get_results # Get me all results
 doc = Nokogiri::HTML(open(RESULTS_URL))
 days = doc.css('#results-data h2').each do |h2_tag|
 date = Date.parse(h2_tag.text.strip).to_date
  matches = h2_tag.xpath('following-sibling::*[1]').css('tr.report')
  matches.each do |match|
    home_team = match.css('.team-home').text.strip
    away_team = match.css('.team-away').text.strip
    score = match.css('.score').text.strip
 Result.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
end

谢谢你的帮助

编辑

好吧，看来我可以用一些ruby，用select？但不知道如何实现。下面的例子

.select{|th|th.text =~ /Barclays Premier League/}

或者更多的阅读已经说过可以使用xpath

matches = h2_tag.xpath('//th[contains(text(), "Barclays Premier League")]').css('tr.report')

或

已经尝试过xpath方法，但显然是错误的，因为没有任何节省

谢谢

我更喜欢一种方法，您可以深入了解您所需要的内容。查看源代码，您需要匹配详细信息：

    <td class='match-details'>
        <p>
            <span class='team-home teams'><a href='...'>Brechin</a></span>
            <span class='score'><abbr title='Score'> 0-2 </abbr></span>
            <span class='team-away teams'><a href='...'>Alloa</a></span>
        </p>
    </td>

td/p

就足够了，因为匹配详细信息是唯一包含

的详细信息，但是如果需要，可以将类添加到

td

中

然后，您完全按照自己的方式获取信息：

matches.each do |match|
  home_team = match.css('.team-home').text.strip
  away_team = match.css('.team-away').text.strip
  score = match.css('.score').text.strip
  ...
end

剩下的一项任务是：获取每次比赛的日期。回顾源代码，您可以返回到第一个包含表，并看到前面的第一个

h2

节点拥有它。您可以用XPath表示：

date = match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text

将所有内容放在一起

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')
  matches.each do |match|
    home_team = match.css('.team-home').text.strip
    away_team = match.css('.team-away').text.strip
    score = match.css('.score').text.strip
    date = Date.parse(match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text).to_date
    Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
  end
end

为了好玩，以下是我将如何转换@Mark Thomas的解决方案：

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  doc.search('h2.table-header').each do |h2|
    date = Date.parse(h2.text).to_date
    next unless h2.at('+ table th[2]').text['Barclays Premier League']
    h2.search('+ table tbody tr').each do |tr|
      home_team = tr.at('.team-home').text.strip
      away_team = tr.at('.team-away').text.strip
      score = tr.at('.score').text.strip
      Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
    end
  end
end

通过对这些h2的第一次迭代，您可以得到：

优点：

将日期从循环中拉出
更简单的表达方式（你可能不会太担心这些）但是想想跟在你后面的那个人。）

缺点：

额外的几个字节的代码

哇，令人惊讶的解释，这很有帮助。我想，我需要多读几遍才能理解。。。我知道这听起来很傻，但这一切是如何结合在一起的，我是不是在替换日期和匹配变量的顺序与我原来的设置相同？我为你把它放在一起。这一切都是我脑子里想不出来的，因为我不是在一台有Ruby的电脑前，但我很确定它会工作：）嗨，再次感谢你不能测试它，但目前没有任何东西保存到模型中，你能看到的任何东西都可能是不正确的吗？我自己也在尝试一些事情，但还没有快乐，我注意到这应该是结果！但这并没有帮助（我的代码中的错误部分）很好，但是如果你先遍历日期，你就不需要返回日期了。@pguardiario这是大多数人都会做的，它最终会产生更多的代码和更多的循环。这就是为什么我更喜欢“向下钻取”方法，它可以最大限度地减少嵌套循环的需要。在我看来，如果您正在寻找20件事情，那么应该有20次迭代。这意味着有时候你必须回到树上去寻找一些东西，这很好。Nokogiri不能做Ajax，也不能做HTTP。它只能处理字符串和文件。但是，如果您知道正确的URL，您可以让OpenURI发出HTTP请求并检索XML或HTML，然后将其传递给Nokogiri进行进一步处理，它应该有一组参数，而不是在控制台中。当浏览器请求它时，您可以使用Firebug或类似工具来计算URL，然后将其复制到源代码中。你必须弄清楚是否有任何查询参数是动态的，如果是，它们应该是什么，但是一旦你知道了这些，你就应该能够获取数据。通过抓取url，现在我可以使用我的原始逻辑进行一些小的调整，尽管@MarkThomas answer的解释非常好，学到了很多东西：）干净漂亮。请注意，此解决方案解析所有日期，无论该日期是否有任何联赛。另外，您正在Ruby land中进行文本匹配，而不是Nokogiri，这会稍微慢一点，但允许您专门使用CSS，这很好。

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')
  matches.each do |match|
    home_team = match.css('.team-home').text.strip
    away_team = match.css('.team-away').text.strip
    score = match.css('.score').text.strip
    date = Date.parse(match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text).to_date
    Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
  end
end

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  doc.search('h2.table-header').each do |h2|
    date = Date.parse(h2.text).to_date
    next unless h2.at('+ table th[2]').text['Barclays Premier League']
    h2.search('+ table tbody tr').each do |tr|
      home_team = tr.at('.team-home').text.strip
      away_team = tr.at('.team-away').text.strip
      score = tr.at('.score').text.strip
      Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
    end
  end
end