Html Nokogiri解析元素之间的剪切内容

Html Nokogiri解析元素之间的剪切内容,html,ruby,parsing,nokogiri,open-uri,Html,Ruby,Parsing,Nokogiri,Open Uri,我在谷歌上搜索了我案例中一半的网络搜索帮助 所以,我需要的是: 我有用于解析的HTML结构,如下所示: <div class="foo"> <div class='bar' dir='ltr'> <div id='p1' class='par'> <p class='sb'> <span id='dc_1_1' class='dx'> <a href='/bar32560

我在谷歌上搜索了我案例中一半的网络搜索帮助

所以,我需要的是:

我有用于解析的HTML结构,如下所示:

<div class="foo">
  <div class='bar' dir='ltr'>
    <div id='p1' class='par'>
      <p class='sb'>
        <span id='dc_1_1' class='dx'>
          <a href='/bar32560'>1</a>
        </span>
        Neque porro 
        <a href='/xyz' class='mr'>+</a>
        quisquam est 
        <a href='/xyz' class='mr'>+</a>
        qui. 
      </p>
    </div>
    <div id='p2' class='par'>
      <p class='sb'>
        <span id='dc_1_2' class='dx'>
          <a href='/foo12356'>2</a>
        </span>
        dolorem ipsum 
        <a href='/xyz' class='mr'>+</a>
        quia dolor sit amet, 
        <a href='/xyz' class='mr'>+</a>
        consectetur, adipisci velit.
      </p>
    </div>
    <div id='p3' class='par'>
      <p class='sb'>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>3</a>
        </span>
        Neque porro quisquam 
        <a href='/xyz' class='mr'>+</a>
        est qui dolorem ipsum quia dolor sit 
        <a href='/xyz' class='mr'>+</a>
        amet, t.
        <a href='/xyz' class='mr'>+</a>
        <span id='dc_1_4' class='dx'>
          <a href='/barefoot4135'>4</a>
        </span>
        consectetur, 
        <a href='/xyz' class='mr'>+</a>
        adipisci veli.
        <span id='dc_1_5' class='dx'>
          <a href='/barfoo05123'>5</a>
       </span>
       Neque porro 
       <a href='/xyz' class='mr'>+</a>
       quisquam est
       <a href='/xyz' class='mr'>+</a>
       qui.
     </p>
   </div>
 </div>
</div>
Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam.
Saved record with: book: 1, chapter: 1, verse: 2, body: Est qui dolorem
Saved record with: book: 1, chapter: 1, verse: 3, body: 2 est qui dolorem ipsum quia dolor sit.
我现在使用的代码:

page = Nokogiri::HTML(open(url))
x = page.css('.mr').remove
x.xpath("//div[contains(@class, 'par')]").map do |node|
  body = node.text
end
我的结果是:

scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t. 4 consectetur, adipisci veli. 5 Neque porro quisquam est qui.
Saved record with: book: 1, chapter: 1, verse: 1, body:  <here is last part of last sentence in first paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 2, body:  <here is last part of last sentence in second paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 3, body:
Saved record with: book: 1, chapter: 1, verse: 4, body:
Saved record with: book: 1, chapter: 1, verse: 5, body:  <here is last sentence in third paragraph. It is after last "+" in this paragraph and have no more "+" signs(href)
因此,本文从div段落类“par”中提取了整个文本。我需要在每一个跨距后用他的内容编号将整段文字删去。或者在每次跨距前剪掉那些div

我需要像这样的东西:

SPAN.text + P.text - a.mr
我不知道…怎么做

请帮我分析一下。我想我需要在每次跨距后/跨距前刮擦

请帮帮我,我已经尝试了所有我发现的东西


编辑DUCK@Duck1337:

我使用以下代码:

def verses
    page = Nokogiri::HTML(open(url))
    i=0
    x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM").map do |node|
    i+=1
    body = node
    VerseSource.new(body, book_num, number, i)
  end
end
我需要这个,因为我用文本解析一个大网站。没有更多的方法了。因此,我的最终输出如下所示:

Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam est qui.
但是如果我有一个带多个句子的单句话,那么你的代码会把它按每一个句子分开。所以这是一个很大的分歧

例如:

    <div id='p1' class='par'>
      <p class='sb'>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>1</a>
        </span>
        Neque porro quisquam. Est qui dolorem
        <a href='/xyz' class='mr'>+</a>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>2</a>
        </span>
        est qui dolorem ipsum quia dolor sit. 
        <a href='/xyz' class='mr'>+</a>
        amet, t.
希望你明白我的意思。非常感谢你这么做。如果你能修改这将是伟大的


编辑:@KARDEIZ

谢谢你的回答!当我在我的方法中使用您的代码时:它解析了真正的随机数据

def verses
  page = Nokogiri::HTML(open(url))
  i=0
  #page.css(".mr").remove
  page.xpath("//div[contains(@class, 'par')]//span").map do |node|
    node.content.strip.tap do |out|
      while nn = node.next
        break if nn.name == 'span'
        out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
        node = nn
      end
    end
    i+=1
    body = node
    VerseSource.new(body, book_num, number, i)
  end
end
def-verses
page=Nokogiri::HTML(打开(url))
i=0
#第页css(“.mr”)。删除
xpath(“//div[contains(@class,'par')]//span”).map do | node|
node.content.strip.tap do | out|
而nn=node.next
如果nn.name=='span'则中断
out我将您的输入保存为桌面上的“temp.html”

require 'open-uri'
require 'nokogiri'

$page_html = Nokogiri::HTML.parse(open("/home/user/Desktop/temp.html"))

output = $page_html.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM")

# I found the pattern ". " in every line, so i replaced ". " with (". HAM")
# I did that by using gsub(". ", ". HAM") this means replace ". " with ". HAM"

# then i split up the string with " HAM" so it preserved the "." in each item in the array


output = ["1 Neque porro quisquam est qui.", "2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.", "3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.", "4 consectetur, adipisci veli.", "5 Neque porro quisquam est qui."]
编辑:

尝试以下方法:

x.xpath("//div[contains(@class, 'par')]//span").map do |node|
  out = node.content.strip
  if following = node.at_xpath('following-sibling::text()')
    out << ' ' << following.content.strip
  end
  out
end
产出:

[
  "1 Neque porro quisquam est qui.",
  "2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.",
  "3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.",
  "4 consectetur, adipisci veli.",
  "5 Neque porro quisquam est qui."
]
也可以使用纯XPath实现这一点(请参阅),但从编码角度来看,此解决方案更简单

编辑2

试试这个:

def verses
  page = Nokogiri::HTML(open(url))
  i=0
  page.xpath("//div[contains(@class, 'par')]//span").map do |node|
    body = node.content.strip.tap do |out|
      while nn = node.next
        break if nn.name == 'span'
        out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
        node = nn
      end
    end
    i+=1
    VerseSource.new(body, book_num, number, i)
  end
end
def-verses
page=Nokogiri::HTML(打开(url))
i=0
xpath(“//div[contains(@class,'par')]//span”).map do | node|
body=node.content.strip.tap do | out|
而nn=node.next
如果nn.name=='span'则中断
out
需要“nokogiri”
你的_html=Neque porro quisquam est qui。
刮伤的身体2=>dolorem ipsum quia dolor sit amet,concertetur,adipisci velit。
刮伤的身体3=>Neque porro quisquam est qui dolorem ipsum quia dolor sit amet,t。
刮伤的身体4=>Concertetur,adipisci veli。
刮伤的身体5=>Neque porro quisquam est qui。
新html的答案:

require 'nokogiri'

html = <<END_OF_HTML
your new html here
END_OF_HTML

html_doc  = Nokogiri::HTML(html)
current_group_number = nil
non_ws_text = []  #non_whitespace_text for each group

html_doc.css("div.par > p").each do |p|   #p's that are direct children of <div class="par">
  p.xpath("./node()").each do |node|  #All Text and Element nodes that are direct children of p tag.
    case node
    when  Nokogiri::XML::Element
      if node.name == 'span'
        node.xpath(".//a").each do |a|  #Step through all the <a> tags inside the <span>
          md = a.text.match(/\A (\d+) \z/xm)  #Check for numbers

          if md  #Then found a number, so it's the start of the next group
            if current_group_number  #then print the results for the current group
              print "scraped_body #{current_group_number} => "
              puts "#{current_group_number} #{non_ws_text.join(' ')}"
              non_ws_text = []
            end
            current_group_number = md[1] #Record the next group number 
            break  #Only look for the first <a> tag containing a number
          end

        end
      end

    when Nokogiri::XML::Text
      text = node.text
      non_ws_text << text.strip if text !~ /\A \s+ \z/xm 
    end

  end
end

#For the last group: 
print "scraped_body #{current_group_number} => "
puts "#{current_group_number} #{non_ws_text.join(' ')}"

--output:--
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.
需要“nokogiri”
html=1奈克·波罗·奎斯夸姆(Neque porro quisquam est qui)。
刮伤的尸体2=>2个同侧双截骨,位于阿梅特、康塞特图尔、阿迪皮西·维利特。
刮伤的尸体3=>3只,它们是同侧身体的一部分,是同侧身体的一部分。
刮伤的身体4=>4个,阿迪皮西·维利。
刮伤的身体5=>5英寸。

Thank's@kardeiz。对不起,我忘记了HTML结构中非常重要的一点。在每个段落中,我都有一个与.mr类链接的链接,因为“+”在文本的某个部分后面唱着什么是指向词典的链接-为了解释这一部分。当我使用你的解决方案时,我只收到span后的第一段元素-我以前也尝试过这个。这不是我所需要的,因为它只是第一段中的一部分:Neque porroI编辑了我的问题,使之更加准确和完整。请再看看。再次感谢!谢谢你的回答和更新。我试过你的代码,但我遇到了问题。请查看我的编辑:KARDEIZ。希望它清晰易读。谢谢@hash4di,我已经更新了我的答案。
body
应该是节点还是字符串?在我更新的答案中,
body
将设置为前面提到的字符串值,例如:“1 Neque porro quisquam est qui.”完美!!!现在是我所在时区的晚上23点,所以我的状态不是很好,但这看起来是合法的。谢谢现在:)我明天再查。干杯谢谢@Duck1337的回答。很抱歉,我忘记了HTML结构中非常重要的部分。在除span元素外的每个段落部分中,我都有一个href作为“+”符号,这是指向词典解释文本前一部分的链接。因此模式更复杂,因为这个a_href位于随机位置。我对我的问题进行了编辑,以使其更加准确和完整。我添加了另一个.gsub(“+”,”),因此它删除了a_hrefThanks@Duck1337的链接。但我还是有问题。请在我的问题“编辑鸭子”中回顾编辑。多谢!试试看,x=page.css(“p”).text.gsub(“+”,”).split.join(“.gsub(“.”,“.HAM”).split(“HAM”)do | node |而不是x=page.css(“p”).text.gsub(“+”,”).split.join(“”.gsub(“.”,“.HAM”).split(“.HAM”).map do | node |谢谢@7stud的回答。很抱歉,我忘记了HTML结构中非常重要的部分。在除span元素外的每个段落部分中,我都有一个href作为“+”符号,这是指向词典解释文本前一部分的链接。因此模式更复杂,因为这个a_href位于随机位置。我对我的问题进行了编辑,使之更加准确和完整。但当我使用你的解决方案时,我什么都没有收到。您的REGEXP中没有输入错误?@hash4di,我在我的帖子中添加了一个修改后的答案。
[
  "1 Neque porro quisquam est qui.",
  "2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.",
  "3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.",
  "4 consectetur, adipisci veli.",
  "5 Neque porro quisquam est qui."
]
def verses
  page = Nokogiri::HTML(open(url))
  i=0
  page.xpath("//div[contains(@class, 'par')]//span").map do |node|
    body = node.content.strip.tap do |out|
      while nn = node.next
        break if nn.name == 'span'
        out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
        node = nn
      end
    end
    i+=1
    VerseSource.new(body, book_num, number, i)
  end
end
require 'nokogiri'

your_html =<<END_OF_HTML
<your html here>
END_OF_HTML

doc  = Nokogiri::HTML(your_html)
text_nodes = doc.xpath("//div[contains(@class, 'par')]/p/child::text()")

results = text_nodes.reject do |text_node| 
  text_node.text.match /\A \s+ \z/x  #Eliminate whitespace nodes
end

results.each_with_index do |node, i|
  puts "scraped_body#{i+1} => #{node.text.strip}"
end


--output:--
scraped_body1 => Neque porro quisquam est qui.
scraped_body2 => dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.
scraped_body3 => Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body4 => consectetur, adipisci veli.
scraped_body5 => Neque porro quisquam est qui.
require 'nokogiri'

html = <<END_OF_HTML
your new html here
END_OF_HTML

html_doc  = Nokogiri::HTML(html)
current_group_number = nil
non_ws_text = []  #non_whitespace_text for each group

html_doc.css("div.par > p").each do |p|   #p's that are direct children of <div class="par">
  p.xpath("./node()").each do |node|  #All Text and Element nodes that are direct children of p tag.
    case node
    when  Nokogiri::XML::Element
      if node.name == 'span'
        node.xpath(".//a").each do |a|  #Step through all the <a> tags inside the <span>
          md = a.text.match(/\A (\d+) \z/xm)  #Check for numbers

          if md  #Then found a number, so it's the start of the next group
            if current_group_number  #then print the results for the current group
              print "scraped_body #{current_group_number} => "
              puts "#{current_group_number} #{non_ws_text.join(' ')}"
              non_ws_text = []
            end
            current_group_number = md[1] #Record the next group number 
            break  #Only look for the first <a> tag containing a number
          end

        end
      end

    when Nokogiri::XML::Text
      text = node.text
      non_ws_text << text.strip if text !~ /\A \s+ \z/xm 
    end

  end
end

#For the last group: 
print "scraped_body #{current_group_number} => "
puts "#{current_group_number} #{non_ws_text.join(' ')}"

--output:--
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.