Ruby 如何使用Mechanize解析本地文件_Ruby_Web Scraping_Nokogiri_Mechanize_Mechanize Ruby

Ruby 如何使用Mechanize解析本地文件

ruby web-scraping

Ruby 如何使用Mechanize解析本地文件,ruby,web-scraping,nokogiri,mechanize,mechanize-ruby,Ruby,Web Scraping,Nokogiri,Mechanize,Mechanize Ruby,我正在使用Ruby和Mechanize解析本地HTML文件，但我做不到。但是，如果我使用URL，则此操作有效： agent = Mechanize.new #THIS WORKS #url = 'http://www.sample.com/sample.htm' #page = agent.get(url) #this seems to work just fine but the following below doesn't #THIS FAILS file = File.read('/h

我正在使用Ruby和Mechanize解析本地HTML文件，但我做不到。但是，如果我使用URL，则此操作有效：

agent = Mechanize.new
#THIS WORKS
#url = 'http://www.sample.com/sample.htm'
#page = agent.get(url) #this seems to work just fine but the following below doesn't

#THIS FAILS
file = File.read('/home/user/files/sample.htm') #this is a regular html file
page = Nokogiri::HTML(file)
pp page.body #errors here

page.search('/div[@class="product_name"]').each do |node|
  text = node.text  
  puts "product name: " + text.to_s
end

错误是：

/home/user/code/myapp/app/models/program.rb:35:in `main': undefined method `body' for #<Nokogiri::HTML::Document:0x000000011552b0> (NoMethodError)

/home/user/code/myapp/app/models/program.rb:35:in'main'：未定义#的方法'body'（NoMethodError）

如何获取页面对象以便在其上搜索

Mechanize使用URI字符串指向它应该解析的内容。通常我们会使用“

http

”或“

https

”方案来指向web服务器，这就是Mechanize的优势所在，但也有其他方案可用，包括“”，可用于加载本地文件

我的桌面上有一个名为“test.rb”的小HTML文件：

产出：

<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello World!
</p>
</body>
</html>

解析后输出相同的文件

回到您的问题，如何仅使用Nokogiri查找节点：

将

test.html

更改为：

<!DOCTYPE html>
<html>
<head></head>
<body>
<div class="product_name">Hello World!</div>
</body>
</html>

显示Nokogiri找到了节点并返回了文本

示例中的此代码可能更好：

text = node.text  
puts "product name: " + text.to_s

node.text

返回一个字符串：

doc = Nokogiri::HTML('<p>hello world!</p>')
doc.at('p').text # => "hello world!"
doc.at('p').text.class # => String

doc=Nokogiri:：HTML（“helloworld！”）
doc.at（'p'）。text#=>“你好，世界！”
doc.at（'p'）.text.class#=>字符串

所以

text.to_s

是多余的。只需使用

文本

错误是正确的，Nokogiri的文档没有“

正文

”方法。Nokogiri是Mechanize下面的下一层，所以你必须使用它的方法。你是世界的祝福。非常感谢你。谢谢你教我如何更优雅地使用我的代码。我不确定这是否是好事，但这正是我们在这里应该做的。Nokogiri是一个非常酷的工具；我用它为一家公司编写了一个大型RSS/RDF/Atom聚合器，发现它非常容易使用，特别是与我们以前必须使用的其他工具相比。

<!DOCTYPE html>
<html>
<head></head>
<body>
<div class="product_name">Hello World!</div>
</body>
</html>

require 'nokogiri'

doc = Nokogiri::HTML(File.read('/Users/ttm/Desktop/test.html'))
doc.search('div.product_name').map(&:text)
# => ["Hello World!"]

text = node.text  
puts "product name: " + text.to_s

doc = Nokogiri::HTML('<p>hello world!</p>')
doc.at('p').text # => "hello world!"
doc.at('p').text.class # => String