Ruby Mechanize、Nokogiri和Net:：HTTP_Ruby_Nokogiri_Mechanize_Eventmachine_Net Http

Ruby Mechanize、Nokogiri和Net:：HTTP

ruby

Ruby Mechanize、Nokogiri和Net:：HTTP,ruby,nokogiri,mechanize,eventmachine,net-http,Ruby,Nokogiri,Mechanize,Eventmachine,Net Http,我正在使用Net:：HTTP进行HTTP请求并获得响应： uri = URI("http://www.example.com") http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port) request = Net::HTTP::Get.new uri.request_uri response = http.request request # Net::HTTPResponse object body = respons

我正在使用Net:：HTTP进行HTTP请求并获得响应：

uri = URI("http://www.example.com")
http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port)
request = Net::HTTP::Get.new uri.request_uri
response = http.request request # Net::HTTPResponse object
body = response.body

如果我必须使用Nokogiri gem来解析此HTML响应，我将执行以下操作：

nokogiri_obj = Nokogiri::HTML(body)

但如果我想使用Mechanize gem，我需要这样做：

agent = Mechanize.new
mechanize_obj = agent.get("http://www.example.com")

我是否可以使用Net:：Http获取HTML响应，然后使用Mechanize gem将其转换为Mechanize对象，而不是使用

agent.get（）

编辑：

绕过

agent.get（）

方法的原因是我试图使用

EventMachine:：Iterator

来进行并发

EM-HTTP

请求

EventMachine.run do
  EM::Iterator.new(urls, 3).each do |url,iter|
    puts "giving   #{url}   to httprequest now"
    http = EM::HttpRequest.new(url).get
    http.callback { |resp|
      uri = resp.send(:URI, url)
      puts "inside callback of #{url}"
      body = resp.response
      page = agent.parse(uri, resp, body)
    }
    iter.next
  end
end

但它不起作用。我得到一个错误：

/usr/local/rvm/gems/ruby-1.9.3-p194/gems/mechanize-2.5.1/lib/mechanize.rb:1165:in`parse': undefined method `[]' for #<EventMachine::HttpClient:0x0000001c18eb30> (NoMethodError)

使用em-http时，是否为

parse

方法传递了错误的参数

看起来

Mechanize

有一个

mechanize_obj = Mechanize.parse(uri, response, body)

我不知道为什么您认为使用Net:：HTTP会更好。Mechanize将处理重定向和cookie，并提供对Nokogiri解析文档的随时访问

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.example.com')

# Use Nokogiri to find the content of the <h1> tag...
puts page.at('h1').content # => "Example Domains"

要求“机械化”
agent=Mechanize.new
page=agent.get（'http://www.example.com')
#使用Nokogiri查找标记的内容。。。
将page.at（'h1'）.content#=>“示例域”

注意，访问example.com不需要设置

user\u agent

如果要使用线程引擎检索页面，请查看

你为什么要这么做？agent.get要简单得多。你做的工作太多了。Mechanize将为您处理

get

。Mechanize还在内部使用Nokogiri进行解析，因此您可以请求Nokogiri解析的文档进行额外查找。我已经编辑了问题..谢谢..事实上，我在代码后面使用Mechanize的方式来获取所需的数据。但是我想知道我是否可以像问题中提到的那样将em http与mechanize结合起来。我建议使用Typhous。请在我的回答中查看我的附加注释。谢谢@Casper..Mechanize.parse方法对Net:：HTTP正确工作…如何对em HTTP使用相同的方法？我认为我将错误的参数传递给'parse'方法，而将其与em-http..@Gameboy一起使用。我将针对该问题发布一个新问题。我不确定

emhttp

的响应类是否与

Net:：http

响应兼容，这正是

Mechanize

所期望的。您可能需要对某些内容进行修补或将响应转换为兼容。

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.example.com')

# Use Nokogiri to find the content of the <h1> tag...
puts page.at('h1').content # => "Example Domains"