Ruby 使用Nokogiri提取文本保留链接_Ruby_Web Scraping_Nokogiri_Mechanize

Ruby 使用Nokogiri提取文本保留链接

ruby web-scraping

Ruby 使用Nokogiri提取文本保留链接,ruby,web-scraping,nokogiri,mechanize,Ruby,Web Scraping,Nokogiri,Mechanize,如何从下面的中提取文本，同时将保留在某个地方。一些尾随文本。预期产出： Some <a href="http://somewhere.com">link</a> going somewhere. <a href="http://lowendbox.com/">Low end</a> Some trailing text. 有些人要去某个地方。一些尾随文本。我能想到的唯一解决方案是重写Nokogiritext方法并递归子对象，希

如何从下面的

中提取文本，同时将

保留在某个地方。



一些尾随文本。

预期产出：

Some <a href="http://somewhere.com">link</a> going somewhere.
<a href="http://lowendbox.com/">Low end</a>
Some trailing text.

有些人要去某个地方。
一些尾随文本。

我能想到的唯一解决方案是重写Nokogiri

text

方法并递归

子对象

，希望能找到一些简单的解决方案。

在

中不能有

ul

，因此任何将其解析为html4或html5的尝试都将失败。这就剩下了regex，它可以很容易地解决这个问题：

str = <<EOF
<p>
  Some <a href="http://somewhere.com">link</a> going somewhere.
  <ul>
    <li><a href="http://lowendbox.com/">Low end</a></li>
  </ul>
  Some trailing text.
</p>
EOF
puts str.gsub(/<\/?(p|ul|li)>/,'')

#  Some <a href="http://somewhere.com">link</a> going somewhere.
#
#    <a href="http://lowendbox.com/">Low end</a>
#
#  Some trailing text.

str=欢迎使用堆栈溢出。见“”和Jon Skeet的“”。我们需要看到你努力的证据。你找到解决办法了吗？如果是，你发现了什么？为什么没有帮助？你写代码了吗？若否，原因为何？如果是这样的话，您所编写的演示问题的最少代码是多少。如果没有这一点，看起来你没有试图让我们为你解决问题，这不是你的目的。你想做的并不难，但也不简单。您必须获取
标记的内部html，然后升级内部
以替换。我不会为您编写代码，因为它包含在SO和Nokogiri教程的多个答案中，而您没有表现出努力。让我们看看你写了什么，我们会付出更多的努力来帮助你。
str = <<EOF
<p>
  Some <a href="http://somewhere.com">link</a> going somewhere.
  <ul>
    <li><a href="http://lowendbox.com/">Low end</a></li>
  </ul>
  Some trailing text.
</p>
EOF
puts str.gsub(/<\/?(p|ul|li)>/,'')

#  Some <a href="http://somewhere.com">link</a> going somewhere.
#
#    <a href="http://lowendbox.com/">Low end</a>
#
#  Some trailing text.