如何使用Ruby和Nokogiri解析Google图像URL?

如何使用Ruby和Nokogiri解析Google图像URL?,ruby,regex,rubygems,nokogiri,Ruby,Regex,Rubygems,Nokogiri,我正试图在谷歌图像网页上制作一个包含所有图像文件的数组 我需要一个正则表达式来提取imagurl=之后的所有内容,并在&之前结束,如下图所示: <a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg&amp;imgrefurl=http://www.t

我正试图在谷歌图像网页上制作一个包含所有图像文件的数组

我需要一个正则表达式来提取imagurl=之后的所有内容,并在&之前结束,如下图所示:

<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world-   christmas/images/20031chapel20031-silent-night-chapel.jpg&amp;imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&amp;usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&amp;h=400&amp;w=400&amp;sz=58&amp;hl=en&amp;start=19&amp;zoom=1&amp;tbnid=ajDcsGGs0tgE9M:&amp;tbnh=124&amp;tbnw=124&amp;ei=qagfUbXmHKfv0QHI3oG4CQ&amp;itbs=1&amp;sa=X&amp;ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>
我觉得我可以用正则表达式来实现这一点,但是我找不到一种方法来使用正则表达式搜索我解析的文档,但是我找不到任何解决方案

str = '<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world-     christmas/images/20031chapel20031-silent-night-chapel.jpg&amp;imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&amp;usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&amp;h=400&amp;w=400&amp;sz=58&amp;hl=en&amp;start=19&amp;zoom=1&amp;tbnid=ajDcsGGs0tgE9M:&amp;tbnh=124&amp;tbnw=124&amp;ei=qagfUbXmHKfv0QHI3oG4CQ&amp;itbs=1&amp;sa=X&amp;ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>'
str.split('imgurl=')[1].split('&amp')[0]
#=> "http://www.trendytree.com/old-world-     christmas/images/20031chapel20031-silent-night-chapel.jpg"
这就是你要找的吗


这就是你想要的吗?

要获得你想要的所有img URL

# get all links
url = 'some-google-images-url'
links = Nokogiri::HTML( open(url) ).css('a')

# get regex match or nil on desired img
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] }

# get rid of nils
img_urls.compact

您需要的正则表达式是/imgurl=.*?&/因为您需要imgurl=和&之间的非贪婪匹配,否则贪婪的.*会将所有内容都保留到字符串中的最后一个。

以获取您想要的所有img URL

# get all links
url = 'some-google-images-url'
links = Nokogiri::HTML( open(url) ).css('a')

# get regex match or nil on desired img
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] }

# get rid of nils
img_urls.compact

您想要的正则表达式是/imgurl=.*?&/因为您想要imgurl=和&之间的非贪婪匹配,否则贪婪的.*将把字符串中的所有内容都排到最后。

使用正则表达式的问题是您对URL中参数的顺序有过多的了解。如果订单发生变化,或&;这个正则表达式不起作用

而是解析URL,然后将值拆分为:

# encoding: UTF-8

require 'nokogiri'
require 'cgi'
require 'uri'

doc = Nokogiri::HTML.parse('<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg&amp;imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&amp;usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&amp;h=400&amp;w=400&amp;sz=58&amp;hl=en&amp;start=19&amp;zoom=1&amp;tbnid=ajDcsGGs0tgE9M:&amp;tbnh=124&amp;tbnw=124&amp;ei=qagfUbXmHKfv0QHI3oG4CQ&amp;itbs=1&amp;sa=X&amp;ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>')

img_url = doc.search('a').each do |a|
  query_params = CGI::parse(URI(a['href']).query) 
  puts query_params['imgurl']
end
使用URI和CGI是因为URI的decode_www_表单在尝试解码查询时引发异常

我还知道使用以下方法将查询字符串解码为哈希:

Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]
这将返回:

{"imgurl"=> "http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg", "imgrefurl"=> "http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html", "usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ", "h"=>"400", "w"=>"400", "sz"=>"58", "hl"=>"en", "start"=>"19", "zoom"=>"1", "tbnid"=>"ajDcsGGs0tgE9M:", "tbnh"=>"124", "tbnw"=>"124", "ei"=>"qagfUbXmHKfv0QHI3oG4CQ", "itbs"=>"1", "sa"=>"X", "ved"=>"0CE4QrQMwEg"}
使用正则表达式的问题在于,您对URL中参数的顺序有过多的了解。如果订单发生变化,或&;这个正则表达式不起作用

而是解析URL,然后将值拆分为:

# encoding: UTF-8

require 'nokogiri'
require 'cgi'
require 'uri'

doc = Nokogiri::HTML.parse('<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg&amp;imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&amp;usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&amp;h=400&amp;w=400&amp;sz=58&amp;hl=en&amp;start=19&amp;zoom=1&amp;tbnid=ajDcsGGs0tgE9M:&amp;tbnh=124&amp;tbnw=124&amp;ei=qagfUbXmHKfv0QHI3oG4CQ&amp;itbs=1&amp;sa=X&amp;ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>')

img_url = doc.search('a').each do |a|
  query_params = CGI::parse(URI(a['href']).query) 
  puts query_params['imgurl']
end
使用URI和CGI是因为URI的decode_www_表单在尝试解码查询时引发异常

我还知道使用以下方法将查询字符串解码为哈希:

Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]
这将返回:

{"imgurl"=> "http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg", "imgrefurl"=> "http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html", "usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ", "h"=>"400", "w"=>"400", "sz"=>"58", "hl"=>"en", "start"=>"19", "zoom"=>"1", "tbnid"=>"ajDcsGGs0tgE9M:", "tbnh"=>"124", "tbnw"=>"124", "ei"=>"qagfUbXmHKfv0QHI3oG4CQ", "itbs"=>"1", "sa"=>"X", "ved"=>"0CE4QrQMwEg"}