在Ruby中使用xPath分组选择段落_Ruby_Xpath_Screen Scraping_Web Scraping_Nokogiri

在Ruby中使用xPath分组选择段落

ruby xpath web-scraping

在Ruby中使用xPath分组选择段落,ruby,xpath,screen-scraping,web-scraping,nokogiri,Ruby,Xpath,Screen Scraping,Web Scraping,Nokogiri,我目前正在使用Ruby和xPath进行一个小型web抓取项目。不幸的是，该网站的结构非常糟糕，这让我产生了一个小问题： <h3>Relevant Headline</h3> <p class="class_a class_b">Content starts in this paragraph...</p> <p class="class_a ">...but this content belongs to the preceding p

我目前正在使用Ruby和xPath进行一个小型web抓取项目。不幸的是，该网站的结构非常糟糕，这让我产生了一个小问题：

<h3>Relevant Headline</h3>
<p class="class_a class_b">Content starts in this paragraph...</p>
<p class="class_a ">...but this content belongs to the preceding paragraph</p>
<p class="class_a class_b">Content starts in this paragraph...</p>
<p class="class_a ">...but this content belongs to the preceding paragraph</p>
<h3>Some other Headline</h3>

但现在困难来了：上面的两段属于同一段。类为_b（第一个）的段落开始一个新的数据条目，下一个（第二个）属于此条目。3和4是一样的。问题是：有时3段属于一起，有时4段属于一起，但大多数情况下有一对段落属于一起

如何按组选择这些内部段落并在Ruby中将它们组合成一个字符串？

如果您不介意使用xpath和nokogiri的组合，可以执行以下操作：

paragraph_text = Array.new
doc.xpath('//p[preceding-sibling::h3[1][contains(text(), "Relevant")]]').each do |p|
    if p.attribute('class').text.include?('class_b')
        paragraph_text << p.content
    else
        paragraph_text[-1] += p.text
    end
end
puts paragraph_text
#=> ["Content starts in this paragraph......but this content belongs to the preceding paragraph",  "Content starts in this paragraph......but this content belongs to the preceding paragraph"]

段落_text=Array.new
doc.xpath（'//p[前面的同级：：h3[1][contains（text（），“Relevant”）]'）。每个都做| p|
如果p.attribute（'class'）.text.include？（'class_b'））
段落文本[“内容始于本段……但本内容属于上一段”，“内容始于本段……但本内容属于上一段”]

基本上，xpath用于获取段落标记。然后，使用nokogiri/ruby，迭代段落并形成字符串。

可以使用xpath来完成，但我认为使用slice_进行分组更容易：

doc.search('*').slice_before{|n| n.name == 'h3'}.each do |h3_group|
  h3_group.slice_before{|n| n[:class] && n[:class]['class_b']}.to_a[1..-1].each do |p_group|
    puts p_group.map(&:text) * ' '
  end
end

更新

使用css的另一个选项：

doc.search('p.class_b').each do |p|
  str, next_node = p.text, p
  while next_node = next_node.at('+ p:not([class*=class_b])')
    str += " #{next_node.text}"
  end
  puts str
end

你用什么宝石做这个项目？解决方案必须是纯xpath吗？我切换到xpath是因为我找到了xpath解决方案来选择上面两个标题之间的段落。我更喜欢使用nokogiri及其css方法。但是如果我的问题需要xpath，我会使用它（即使我很难理解，至少对我来说是如此。）嘿，贾斯汀，谢谢你的回答，它对我很有帮助。我不明白“段落文本[-1]”是什么意思？数组中的索引[-1]是什么？在数组中，

[-1]

获取最后一个元素。这与通常通过索引获取元素的方式相同。从某种意义上说，负值意味着倒退。在这里的代码上下文中，它表示用“class_b”将文本添加到最后一个字符串中。您好，也感谢您的回答。这比贾斯汀的方法更具可读性。我以前不知道切片法。这里还有一个问题：to_a[1..-1]'做什么？因为前面的slice_返回可枚举，to_a使其成为一个数组，所以我们可以选择一个范围。[1..-1]表示跳过第一个元素，即h3。

doc.search('p.class_b').each do |p|
  str, next_node = p.text, p
  while next_node = next_node.at('+ p:not([class*=class_b])')
    str += " #{next_node.text}"
  end
  puts str
end