Ruby 使用Nokogiri禁用XML转义中的HTML_Ruby_Sinatra_Nokogiri

Ruby 使用Nokogiri禁用XML转义中的HTML

ruby

Ruby 使用Nokogiri禁用XML转义中的HTML,ruby,sinatra,nokogiri,Ruby,Sinatra,Nokogiri,我正在尝试解析来自GoogleDirections API的XML文档到目前为止，我得到的是： x = Nokogiri::XML(GoogleDirections.new("48170", "48104").xml) x.xpath("//DirectionsResponse//route//leg//step").each do |q| q.xpath("html_instructions").each do |h| puts h.inner_html end end 输

我正在尝试解析来自GoogleDirections API的XML文档

到目前为止，我得到的是：

x = Nokogiri::XML(GoogleDirections.new("48170", "48104").xml)
x.xpath("//DirectionsResponse//route//leg//step").each do |q|
  q.xpath("html_instructions").each do |h|
    puts h.inner_html
  end
end

输出如下所示：

Head &lt;b&gt;south&lt;/b&gt; on &lt;b&gt;Hidden Pond Dr&lt;/b&gt; toward &lt;b&gt;Ironwood Ct&lt;/b&gt;
Turn &lt;b&gt;right&lt;/b&gt; onto &lt;b&gt;N Territorial Rd&lt;/b&gt;
Turn &lt;b&gt;left&lt;/b&gt; onto &lt;b&gt;Gotfredson Rd&lt;/b&gt;
...

我希望输出为：

Turn <b>right</b> onto <b>N Territorial Rd</b>

但是如果没有原始的xml，我就无法（也许使用）。想法？

这不是一个好的或干的解决方案，但它可以：

puts h.inner_html.gsub("&lt;b&gt;" , "").gsub("&lt;/b&gt;", "").gsub("&lt;div style=\"font-size:0.9em\"&gt;", "").gsub("&lt;/div&gt;", "")

这不是一个好的或干燥的解决方案，但它可以：

puts h.inner_html.gsub("&lt;b&gt;" , "").gsub("&lt;/b&gt;", "").gsub("&lt;div style=\"font-size:0.9em\"&gt;", "").gsub("&lt;/div&gt;", "")

因为我没有安装GoogleDirections API，所以我无法访问XML，但我强烈怀疑这个问题是因为告诉Nokogiri您正在处理XML。因此，它将返回HTML编码，就像它应该是XML一样

您可以使用以下方式取消HTML的浏览：

CGI::unescape_html('Head &lt;b&gt;south&lt;/b&gt; on &lt;b&gt;Hidden Pond Dr&lt;/b&gt; toward &lt;b&gt;Ironwood Ct&lt;/b&gt;')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"

我不得不再考虑一下。这是我遇到过的事情，但这是我在匆忙工作中逃避的事情之一。修复方法很简单：您使用了错误的方法来检索内容。而不是：

puts h.inner_html

使用：

我用以下方法证明了这一点：

require 'httpclient'
require 'nokogiri'

# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new

doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
  puts html.text
end

哪些产出：

Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]

向南沿南联邦街向西范布伦街驶去
向右拐到W国会大厦
继续行驶到I-290 W
[...]

区别在于

internal_html

直接读取节点的内容，而不进行解码<代码>文本为您解码<代码>文本、

to_str

和

内部文本

在Nokogiri:：XML:：Node内部被别名为

内容

，以便于我们进行解析。

因为我没有安装Google Directions API，所以无法访问XML，但我强烈怀疑这个问题是因为告诉Nokogiri您正在处理XML。因此，它将返回HTML编码，就像它应该是XML一样

您可以使用以下方式取消HTML的浏览：

CGI::unescape_html('Head &lt;b&gt;south&lt;/b&gt; on &lt;b&gt;Hidden Pond Dr&lt;/b&gt; toward &lt;b&gt;Ironwood Ct&lt;/b&gt;')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"

我不得不再考虑一下。这是我遇到过的事情，但这是我在匆忙工作中逃避的事情之一。修复方法很简单：您使用了错误的方法来检索内容。而不是：

puts h.inner_html

使用：

我用以下方法证明了这一点：

require 'httpclient'
require 'nokogiri'

# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new

doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
  puts html.text
end

哪些产出：

Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]

向南沿南联邦街向西范布伦街驶去
向右拐到W国会大厦
继续行驶到I-290 W
[...]

区别在于

internal_html

直接读取节点的内容，而不进行解码<代码>文本为您解码<代码>文本、

to_str

和

内部文本

在Nokogiri:：XML:：Node中内部别名为

内容

，以便于解析。

在CDATA中包装您的节点：

def wrap_in_cdata(node)
    # Using Nokogiri::XML::Node#content instead of #inner_html (which
    # escapes HTML entities) so nested nodes will not work
    node.inner_html = node.document.create_cdata(node.content)
    node
end

Nokogiri:：XML:：Node#internal_html

转义除CDATA部分之外的html实体

fragment = Nokogiri::HTML.fragment "<div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>"
puts fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left &gt; right &gt; straight &amp; reach your destination.</span></div>


fragment.xpath(".//span").each {|node| node.inner_html = node.document.create_cdata(node.content) }
fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span>\n</div>

fragment=Nokogiri:：HTML.fragment“这是一个未缩放的字符串：左转>右转>直走&到达目的地。”
将fragment.inner_放入html
#这是一个未被切换的字符串：左向右直&；到达目的地。
fragment.xpath（“.//span”）.each{| node | node.inner_html=node.document.create_cdata（node.content）}
fragment.inner_html
#这是一个未切换的字符串：左转>右转>直行并到达目的地。\n

在CDATA中包装您的节点：

def wrap_in_cdata(node)
    # Using Nokogiri::XML::Node#content instead of #inner_html (which
    # escapes HTML entities) so nested nodes will not work
    node.inner_html = node.document.create_cdata(node.content)
    node
end

Nokogiri:：XML:：Node#internal_html

转义除CDATA部分之外的html实体

fragment = Nokogiri::HTML.fragment "<div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>"
puts fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left &gt; right &gt; straight &amp; reach your destination.</span></div>


fragment.xpath(".//span").each {|node| node.inner_html = node.document.create_cdata(node.content) }
fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span>\n</div>

fragment=Nokogiri:：HTML.fragment“这是一个未缩放的字符串：左转>右转>直走&到达目的地。”
将fragment.inner_放入html
#这是一个未被切换的字符串：左向右直&；到达目的地。
fragment.xpath（“.//span”）.each{| node | node.inner_html=node.document.create_cdata（node.content）}
fragment.inner_html
#这是一个未切换的字符串：左转>右转>直行并到达目的地。\n

您可以尝试调用：

原始内容（）

？把h.original_的内容放在这里，从这里得到了一个想法：不确定它是否有帮助这不起作用，但只需

content（）

就行了。谢谢请添加您正在使用的XML的小样本。这将帮助我们帮助你。这里是一些来自谷歌。正如我之前所说，

h.content

（和

h.inner\u text

）解决了这个问题。如果有人能解释为什么我愿意接受它作为答案。你能试着打电话：

original\u content（）

？把h.original_的内容放在这里，从这里得到了一个想法：不确定它是否有帮助这不起作用，但只需

content（）

就行了。谢谢请添加您正在使用的XML的小样本。这将帮助我们帮助你。这里是一些来自谷歌。正如我之前所说，

h.content

（和

h.inner\u text

）解决了这个问题。如果有人能解释为什么我愿意接受它作为一个答案。+1，尽管我真的不认为逃避和逃避是最有效的。当然不是。我在解码RSS时多次遇到过它，但那是几年前的事了，所以我不得不记住我做过什么。+1，尽管我并不认为逃跑和逃避是最有效的。当然不是。我在解码RSS时多次遇到它，但那是几年前的事了，所以我必须记住我做了什么。