Ruby 使用nokogiri在HTML标记之间提取文本_Ruby_Xpath_Nokogiri

Ruby 使用nokogiri在HTML标记之间提取文本

ruby xpath

Ruby 使用nokogiri在HTML标记之间提取文本,ruby,xpath,nokogiri,Ruby,Xpath,Nokogiri,我有如下HTML： <h1> Header is here</h1> <h2>Header 2 is here</h2> <p> Extract me!</p> <p> Extract me too!</p> <h2> Next Header 2</h2> <p>not interested</p> <

我有如下HTML：

<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>

标题在这里
标题2在这里
救出我
把我也抽出来
下一标题2
不感兴趣
不感兴趣
标题2在这里
救出我
把我也抽出来

我有一个基本的Nokogiri CSS节点搜索，返回内容，但我找不到如何在第n个关闭的H2和下一个打开的H2之间定位所有文本的示例。我正在创建一个带有输出的CSV，所以我也想读入一个文件列表，并将URL作为第一个结果

此代码可能对您有所帮助，但它仍然需要有关标记位置的更多信息（最好将需要提取的信息放在一些标记之间）

需要“rubygems”
需要“nokogiri”
需要“pp”
html='标题在这里
标题2在这里
救出我
把我也抽出来
下一标题2
不感兴趣
不感兴趣
标题2在这里
救出我
把我也抽出来
';
doc=Nokogiri:：HTML（HTML）；
doc.xpath（“//p”）。每个|
pp el
结束

需要“rubygems”
需要“nokogiri”
h='标题在这里
标题2在这里
救出我
把我也抽出来
下一标题2
不感兴趣
不感兴趣
标题2在这里
救出我
把我也抽出来
'
doc=Nokogiri:：HTML（h）
#指定要提取的分隔符标记之间的范围
#三重点用于排除端点
#1…2表示1而不是2
提取范围=[
2...3,
4...5
]
#计数为分隔符的标记，不提取
分隔符\标记=[
“h1”，
“h2”
]
提取的文本=[]
i=0
#将/“html”/“body”更改为包含此列表的标记的正确路径
（doc/“html/“body”）。子项。每个子项都有|
if（分隔符_标记包括？el.name）
i+=1
其他的
提取=假
提取|u范围。每个do|cur|u范围|
如果（当前范围包括？i）
提取=真
打破
结束
结束
如果提取
s=el.inner\U text.strip
除非是空的？
extracted_text不是XPath解决方案，而是一个简单（天真）的实现，它假设start和stop元素共享同一父元素，并允许独立指定start和stop的XPath：
HTML = "<h1>Header is here</h1>
  <h2>Header 2 is here</h2>
     <p>Extract me!</p>
     <p>Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p>Extract me three!</p>
     <p>Extract me four!</p>"

require 'nokogiri'    
class Nokogiri::XML::Node
  # Naive implementation; assumes found elements will share the same parent
  def content_between( start_xpath, stop_xpath=nil )
    node = at_xpath(start_xpath).next_element
    stop = stop_xpath && at_xpath(stop_xpath)
    [].tap do |content|
      while node && node!=stop
        content << node
        node = node.next_element
      end
    end
  end
end

html = Nokogiri::HTML(HTML)
puts html.content_between('//h2[1]','//h2[2]').map(&:content)
#=> Extract me!
#=> Extract me too!
puts html.content_between('//h2[3]').map(&:content)
#=> Extract me three!
#=> Extract me four!

HTML=“标题在这里
标题2在这里
救出我
把我也抽出来
下一标题2
不感兴趣
不感兴趣
标题2在这里
给我三个
给我取四个！“
需要“nokogiri”
类Nokogiri:：XML:：Node
#天真的执行；假设找到的元素将共享同一父元素
def content_between（start_xpath，stop_xpath=nil）
node=at_xpath（start_xpath）。下一个_元素
stop=stop\u xpath&&at\u xpath（stop\u xpath）
[]点击do |内容|
while节点&&node=停止
内容摘录我！
#=>把我也救出来！
将html.content_放在（'//h2[3]'）.map（&:content）之间
#拉我三个！
#=>给我四个！
如果start和stop元素具有相同的父元素，这与单个XPath一样简单。首先，为了清晰起见，我将用一个简化的文档展示它，然后是您的示例文档：
XML = "<root>
  <a/><a1/><a2/>
  <b/><b1/><b2/>
  <c/><c1/><c2/>
</root>"

require 'nokogiri'
xml = Nokogiri::XML(XML)

# Find all elements between 'a' and 'c'
p xml.xpath('//*[preceding-sibling::a][following-sibling::c]').map(&:name)
#=> ["a1", "a2", "b", "b1", "b2"]

# Find all elements between 'a' and 'b'
p xml.xpath('//*[preceding-sibling::a][following-sibling::b]').map(&:name)
#=> ["a1", "a2"]

# Find all elements after 'c'
p xml.xpath('//*[preceding-sibling::c]').map(&:name)
#=> ["c1", "c2"]

XML=”
"
需要“nokogiri”
xml=Nokogiri:：xml（xml）
#查找“a”和“c”之间的所有元素
p xml.xpath（'/*[前面的兄弟姐妹：：a][后面的兄弟姐妹：：c]'）.map（&:name）
#=>[“a1”、“a2”、“b”、“b1”、“b2”]
#查找“a”和“b”之间的所有元素
p xml.xpath（'/*[前面的兄弟姐妹：：a][后面的兄弟姐妹：：b]'）.map（&:name）
#=>[“a1”、“a2”]
#查找“c”之后的所有元素
p xml.xpath（'/*[前面的同级：：c]'）.map（&:name）
#=>[“c1”，“c2”]

现在，这是您的用例（按索引查找）：
HTML=“标题在这里
标题2在这里
救出我
把我也抽出来
下一标题2
不感兴趣
不感兴趣
标题2在这里
给我三个
给我取四个！“
需要“nokogiri”
html=Nokogiri:：html（html）
#查找第一个和第二个h2s之间的所有元素
p html.xpath（'/*[前面的兄弟姐妹：：h2[1]][后面的兄弟姐妹：：h2[2]]）.map（&:content）
#=>[“解救我！”，“也解救我！”]
#查找第三个h2和末端之间的所有元素
p html.xpath（'/*[前面的同级：：h2[3]]）.map（&:content）
#=>[“救出我三个！”，“救出我四个！”]
您有时可以使用NodeSet的&operator来获取节点之间的信息：
doc.xpath('//h2[1]/following-sibling::p') & doc.xpath('//h2[2]/preceding-sibling::p')

谢谢，丹，这会让我花很长时间来整理自己。你介意我问一下吗--你知道我如何给这个脚本提供一个html文件列表吗？@chuckfinleyfiles=[“path_to_file.html”，…]；files.each do | cur|u file | h=file.open（cur|u file，“r”）.read。。。结束Dan，我做不到。我认为这是我不熟悉ruby/nokogiri格式。我将发布一个新问题，以便在示例中使用格式。再次感谢。请在寻求帮助时提供示例代码和所需输出的示例。这有助于我们帮助您。另请参见感谢@tinman和Phrogz——我曾考虑添加我拥有的代码，但它没有找到解决方案，也不会有帮助。Phrogz——看看你的例子和链接，我的解决方案正在重新定向，非常有用，谢谢。p，这是我第一次和nokogiri在一起，知道有很多方法可以剥土豆皮非常有用。。。
XML = "<root>
  <a/><a1/><a2/>
  <b/><b1/><b2/>
  <c/><c1/><c2/>
</root>"

require 'nokogiri'
xml = Nokogiri::XML(XML)

# Find all elements between 'a' and 'c'
p xml.xpath('//*[preceding-sibling::a][following-sibling::c]').map(&:name)
#=> ["a1", "a2", "b", "b1", "b2"]

# Find all elements between 'a' and 'b'
p xml.xpath('//*[preceding-sibling::a][following-sibling::b]').map(&:name)
#=> ["a1", "a2"]

# Find all elements after 'c'
p xml.xpath('//*[preceding-sibling::c]').map(&:name)
#=> ["c1", "c2"]

HTML = "<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p>Extract me!</p>
     <p>Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p>Extract me three!</p>
     <p>Extract me four!</p>"

require 'nokogiri'
html = Nokogiri::HTML(HTML)

# Find all elements between the first and second h2s
p html.xpath('//*[preceding-sibling::h2[1]][following-sibling::h2[2]]').map(&:content)
#=> ["Extract me!", "Extract me too!"]

# Find all elements between the third h2 and the end
p html.xpath('//*[preceding-sibling::h2[3]]').map(&:content)
#=> ["Extract me three!", "Extract me four!"]

doc.xpath('//h2[1]/following-sibling::p') & doc.xpath('//h2[2]/preceding-sibling::p')