使用ruby mechanize抓取数据

使用ruby mechanize抓取数据,ruby,nokogiri,mechanize-ruby,Ruby,Nokogiri,Mechanize Ruby,我正在从中抓取数据 以下是我尝试过的代码: uri = "http://www.mca.gov.in/DCAPortalWeb/dca/MyMCALogin.do?method=setDefaultProperty&mode=53" #html, html_content = @mobj.get_data(uri) agent = Mechanize.new html_page = agent.get uri html_form = html_pag

我正在从中抓取数据

以下是我尝试过的代码:

uri = "http://www.mca.gov.in/DCAPortalWeb/dca/MyMCALogin.do?method=setDefaultProperty&mode=53"
    #html, html_content = @mobj.get_data(uri)

    agent = Mechanize.new 
    html_page  = agent.get uri
    html_form = html_page.form 
    html_form.radiobuttons_with(:name => 'search',:value => '2')[0].check
    html_form.submit
    puts html_page.content
错误:

var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 500 => Net::HTTPInternalServerError for http://www.mca.gov.in/DCAPortalWeb/dca/ProsecutionDetailsSRAction.do -- unhandled response (Mechanize::ResponseCodeError)
from /var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize.rb:1281:in `post_form'
from /var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize.rb:548:in `submit'
from /var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize/form.rb:223:in `submit'
from ministry_corp_aff.rb:32:in `start'
from ministry_corp_aff.rb:52:in `<main>'
var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in'fetch':500=>Net::HTTPInternalServerError forhttp://www.mca.gov.in/DCAPortalWeb/dca/ProsecutionDetailsSRAction.do --未处理的响应(Mechanize::ResponseCodeError)
from/var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize.rb:1281:in'post_form'
from/var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize.rb:548:in“提交”
from/var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize/form.rb:223:在“提交”中
来自部属公司办公楼rb:32:in'start'
来自中国农业部股份有限公司rb:52:in`'

如果我手动单击第三个单选按钮,然后提交它,我会得到一个.zip文件。我试图从该zip文件中提取.xls文件中的数据。

单选按钮有一个onclick-even处理程序,可以触发某些javascript的执行。此外,单击Submit
标记也会执行一些javascript。该javascript可能会设置表单返回的一些值,服务器会检查这些值


Mechanize无法执行javascript。您需要selenium webdriver来实现这一点。

当我们单击链接时,.zip文件被下载,其中包含.xls文件