有人知道Ruby Mechanize的缓存插件吗？_Ruby_Mechanize Ruby

有人知道Ruby Mechanize的缓存插件吗？

ruby

有人知道Ruby Mechanize的缓存插件吗？,ruby,mechanize-ruby,Ruby,Mechanize Ruby,我有一个基于Mechanize的Ruby脚本来抓取一个网站。我希望通过在本地缓存下载的HTML页面来加快速度，使整个“调整输出->运行->调整输出”周期更快。我不希望仅仅为了这个脚本就必须在机器上安装外部缓存。理想的解决方案是将插件机械化并透明地缓存获取的页面、图像等有人知道有一个图书馆会这样做吗？还是实现相同结果的另一种方法（第二轮脚本运行得更快）？我不确定缓存页面会有多大帮助。更有用的是记录以前访问过的URL，这样您就不会重复访问它们。页面缓存是没有意义的，因为当你第一次看到页面时，你应该

我有一个基于Mechanize的Ruby脚本来抓取一个网站。我希望通过在本地缓存下载的HTML页面来加快速度，使整个“调整输出->运行->调整输出”周期更快。我不希望仅仅为了这个脚本就必须在机器上安装外部缓存。理想的解决方案是将插件机械化并透明地缓存获取的页面、图像等

有人知道有一个图书馆会这样做吗？还是实现相同结果的另一种方法（第二轮脚本运行得更快）？

我不确定缓存页面会有多大帮助。更有用的是记录以前访问过的URL，这样您就不会重复访问它们。页面缓存是没有意义的，因为当你第一次看到页面时，你应该已经抓取了重要的信息，所以你需要做的就是检查你是否已经看到了它。如果有的话，抓取你关心的摘要信息，并根据需要对其进行处理

我曾经使用Perl的Mechanize编写分析爬行器。Ruby的Mechanize就是基于此。将以前访问过的URL存储在某种缓存中是很有用的，比如散列，但是，由于应用程序崩溃或主机在会话中期停机，所有以前的结果都将消失。在这一点上，真正的基于磁盘的数据库至关重要

我喜欢Postgres，但即使是SQLite也是一个不错的选择。无论您使用什么，都要在驱动器上获取重要信息，以便在重新启动或崩溃后仍能继续使用

我还建议您使用YAML文件来配置应用程序。将应用程序运行期间可能更改的每个参数都放在其中。然后，编写应用程序，以便它定期检查该文件的修改时间，并在发生更改时重新加载该文件。这样，您可以动态调整其运行时行为。几年前，我不得不写一个蜘蛛来分析财富50强公司的多个网站。该应用程序运行了三周，在与该公司相关的许多不同网站上运行，因为我可以调整用于控制应用程序处理哪些页面的正则表达式，所以我可以在不关闭该应用程序的情况下对其进行微调

将页面写入文件，将每个页面写入一个单独的文件，并将调整周期和运行周期分开怎么样？

这样做的一个好方法是使用（很棒的）

下面是一个您将如何执行的示例：

require 'vcr'
require 'mechanize'

# Setup VCR's configs.  The cassette library directory is where 
# all of your "recordings" are saved as YAML files.  
VCR.configure do |c|
  c.cassette_library_dir = 'vcr_cassettes'
  c.hook_into :webmock
end

# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
  a = Mechanize.new
  a.get('http://google.com/')
end

正如你所看到的。。。VCR在第一次运行时将通信记录为YAML文件：

mario$  find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml

如果您想让VCR创建磁带的新版本，只需删除相应的文件。

如果您在第一次请求后存储了有关页面的一些信息，您可以稍后重新生成页面，而无需从服务器重新请求

# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]

# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)

我在spider/scraper中使用了这种技术，这样就可以在不重新请求所有页面的情况下调整代码。e、 g:

# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
#    for in-memory storage, you could use a Hash.
#    or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)

  def get_cached(uri)
    cache_key = "_cache/#{uri}"

    if args = storage[cache_key]
      logger.debug("getting (cached) #{uri}")
      uri, response, body, code = args
      page = Mechanize::Page.new(uri, response, body, code, agent)
      agent.send(:add_to_history, page)
      page

    else
      logger.debug("getting (UNCACHED) #{uri}")
      page = agent.get(uri)
      storage[cache_key] = [page.uri, page.response, page.body, page.code]
      page

    end
  end

end

它将打印以下信息：

D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
  [#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
   {"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"87",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
   "200"],
 "_cache/http://ifconfig.me/encoding"=>
  [#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
   {"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"42",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "gzip,deflate,identity\n",
   "200"]}

调试--：获取（未缓存）http://ifconfig.me/ua D、 [2013-10-19T14:13:36.375649#18107]调试--：获取（缓存）http://ifconfig.me/ua D、 [2013-10-19T14:13:36.376822#18107]调试--：获取（缓存）http://ifconfig.me/ua D、 [2013-10-19T14:13:36.376910#18107]调试--：获取（未缓存）http://ifconfig.me/encoding D、 [2013-10-19T14:13:52.830416#18107]调试--：获取（缓存）http://ifconfig.me/encoding {u缓存/http://ifconfig.me/ua"=> [#, {“date”=>“Sat，2013年10月19日19:13:33 GMT”， “服务器”=>“Apache”， “vary”=>“接受编码”， “内容编码”=>“gzip”， “内容长度”=>“87”， “连接”=>“关闭”， “内容类型”=>“文本/普通”}， “机械化/2.7.2 Ruby/2.0.0p247(http://github.com/sparklemotion/mechanize/)\n“， "200"], “\u缓存/http://ifconfig.me/encoding"=> [#, {“date”=>“Sat，2013年10月19日19:13:48 GMT”， “服务器”=>“Apache”， “vary”=>“接受编码”， “内容编码”=>“gzip”， “内容长度”=>“42”， “连接”=>“关闭”， “内容类型”=>“文本/普通”}， “gzip，deflate，identity\n”， "200"]}

我不确定这是否适用于您想要的开箱即用，因为它显然是为反向代理而不是代理而设计的，但也许它可以被重新设计来满足您的需要？如果无法解析到同一页面的多个重定向，您仍然可能会得到许多重复的页面。单独的获取和扫描脚本是一个好主意，并且不太难实现。谢谢，谢谢。在运行过程中，我会对访问过的页面进行散列，以避免陷入循环中。我还可以从Mechanize源代码中看到，它还保存了一个历史记录，并在可以修改后使用。我希望有人能扩展它，把历史记录放在磁盘上，或者放在数据库里，或者别的什么地方。

D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
  [#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
   {"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"87",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
   "200"],
 "_cache/http://ifconfig.me/encoding"=>
  [#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
   {"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"42",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "gzip,deflate,identity\n",
   "200"]}