Ruby 将HTML转换为纯文本（包括<；br>；s）_Ruby_Nokogiri

Ruby 将HTML转换为纯文本（包括<；br>；s）

ruby

Ruby 将HTML转换为纯文本（包括<；br>；s）,ruby,nokogiri,Ruby,Nokogiri,是否可以将带有Nokogiri的HTML转换为纯文本？我还想包括标记例如，给定以下HTML： ala ma kota i kot to idiota 当我只调用Nokogiri:：HTML（my_HTML）.text时，它不包括标记： ala ma kota i kot to idiota 默认情况下不存在类似的内容，但您可以轻松地将接近所需输出的内容组合在一起： requi

是否可以将带有Nokogiri的HTML转换为纯文本？我还想包括

标记

例如，给定以下HTML：

<p>ala ma kota</p> <br /> <span>i kot to idiota </span>

当我只调用

Nokogiri:：HTML（my_HTML）.text时，它不包括
标记：
ala ma kota i kot to idiota

默认情况下不存在类似的内容，但您可以轻松地将接近所需输出的内容组合在一起：
require 'nokogiri'
def render_to_ascii(node)
  blocks = %w[p div address]                      # els to put newlines after
  swaps  = { "br"=>"\n", "hr"=>"\n#{'-'*70}\n" }  # content to swap out
  dup = node.dup                                  # don't munge the original

  # Get rid of superfluous whitespace in the source
  dup.xpath('.//text()').each{ |t| t.content=t.text.gsub(/\s+/,' ') }

  # Swap out the swaps
  dup.css(swaps.keys.join(',')).each{ |n| n.replace( swaps[n.name] ) }

  # Slap a couple newlines after each block level element
  dup.css(blocks.join(',')).each{ |n| n.after("\n\n") }

  # Return the modified text content
  dup.text
end

frag = Nokogiri::HTML.fragment "<p>It is the end of the world
  as         we
  know it<br>and <i>I</i> <strong>feel</strong>
  <a href='blah'>fine</a>.</p><div>Capische<hr>Buddy?</div>"

puts render_to_ascii(frag)
#=> It is the end of the world as we know it
#=> and I feel fine.
#=> 
#=> Capische
#=> ----------------------------------------------------------------------
#=> Buddy?

需要“nokogiri”
def render_到_ascii（节点）
blocks=%w[p div address]#要在其后放置换行符的els
交换={“br”=>“\n”，“hr”=>“\n{'-'*70}\n”}要交换的内容
dup=node.dup#不要咀嚼原稿
#消除源代码中多余的空白
xpath（'.//text（）'）。每个{t | t.content=t.text.gsub（//\s+/，''）}
#互换
css（swaps.keys.join（'，）.each{n | n.replace（swaps[n.name]））
#在每个块级别元素后添加两个换行符
css（blocks.join（'，'））。每个{n | n.after（“\n\n”）}
#返回修改后的文本内容
复制文本
结束
frag=Nokogiri:：HTML.fragment“这是世界末日
正如我们
知道它，我感觉到
.
Capische伙计？”
将渲染设置为ascii（frag）
#=>这是我们所知道的世界末日
#=>我感觉很好。
#=> 
#=>Capische
#=> ----------------------------------------------------------------------
#=>伙计？
试试看
Nokogiri:：HTML（my_HTML.gsub（'
'，“\n”））.text
Nokogiri将删除链接，因此我首先使用此选项在文本版本中保留链接：
html_version.gsub!(/<a href.*(http:[^"']+).*>(.*)<\/a>/i) { "#{$2}\n#{$1}" }

我没有编写复杂的regexp，而是使用了Nokogiri
工作解决方案（K.I.S.S！）：
如果您使用HAML，您可以通过将html与“raw”选项f.e
      = raw @product.short_description

html_version.gsub!(/<a href.*(http:[^"']+).*>(.*)<\/a>/i) { "#{$2}\n#{$1}" }

<a href = "http://google.com">link to google</a>

link to google
http://google.com

def strip_html(str)
  document = Nokogiri::HTML.parse(str)
  document.css("br").each { |node| node.replace("\n") }
  document.text
end

      = raw @product.short_description