Html 从网页中提取所有链接

Html 从网页中提取所有链接,html,ruby-on-rails,ruby,regex,hyperlink,Html,Ruby On Rails,Ruby,Regex,Hyperlink,我需要从网页中提取所有链接。我当前的解决方案仅从 问题是html文档中的链接不一定在标记中显示为href属性。我需要从html/css文件中提取所有完整/相对http/https链接。是否有一个可靠的解决方案?您可以使用Ruby的内置URI类来实现这一点。看看这个方法 它不像使用Nokogiri编写的那样智能,可以查看锚、图像、脚本、点击处理程序等,但它是一个良好且快速的起点 例如,查看此问题页面的内容: require 'open-uri' require 'uri' URI.extract

我需要从网页中提取所有链接。我当前的解决方案仅从


问题是html文档中的链接不一定在
标记中显示为
href
属性。我需要从html/css文件中提取所有完整/相对http/https链接。是否有一个可靠的解决方案?

您可以使用Ruby的内置URI类来实现这一点。看看这个方法

它不像使用Nokogiri编写的那样智能,可以查看锚、图像、脚本、点击处理程序等
,但它是一个良好且快速的起点

例如,查看此问题页面的内容:

require 'open-uri'
require 'uri'

URI.extract(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read).grep(/^https?:/)
# => ["http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6",
#     "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
#     "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
#     "https://stackauth.com",
#     "http://chat.stackoverflow.com",
#     "http://blog.stackexchange.com",
#     "http://schema.org/Article",
#     "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
#     "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
#     "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
#     "http://stackexchange.com/legal/privacy-policy'",
#     "http://stackexchange.com/legal/terms-of-service'",
#     "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
#     "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
#     "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
#     "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
#     "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
#     "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
#     "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
#     "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
#     "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
#     "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
#     "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
#     "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
#     "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
#     "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
#     "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
#     "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
#     "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
#     "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
#     "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
#     "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
#     "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
#     "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
#     "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
#     "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
#     "http://blog.stackexchange.com?blb=1",
#     "http://chat.stackoverflow.com",
#     "http://data.stackexchange.com",
#     "http://stackexchange.com/legal",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/about/hiring",
#     "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
#     "http://meta.stackoverflow.com",
#     "http://stackoverflow.com",
#     "http://serverfault.com",
#     "http://superuser.com",
#     "http://webapps.stackexchange.com",
#     "http://askubuntu.com",
#     "http://webmasters.stackexchange.com",
#     "http://gamedev.stackexchange.com",
#     "http://tex.stackexchange.com",
#     "http://programmers.stackexchange.com",
#     "http://unix.stackexchange.com",
#     "http://apple.stackexchange.com",
#     "http://wordpress.stackexchange.com",
#     "http://gis.stackexchange.com",
#     "http://electronics.stackexchange.com",
#     "http://android.stackexchange.com",
#     "http://security.stackexchange.com",
#     "http://dba.stackexchange.com",
#     "http://drupal.stackexchange.com",
#     "http://sharepoint.stackexchange.com",
#     "http://ux.stackexchange.com",
#     "http://mathematica.stackexchange.com",
#     "http://stackexchange.com/sites#technology",
#     "http://photo.stackexchange.com",
#     "http://scifi.stackexchange.com",
#     "http://cooking.stackexchange.com",
#     "http://diy.stackexchange.com",
#     "http://stackexchange.com/sites#lifearts",
#     "http://english.stackexchange.com",
#     "http://skeptics.stackexchange.com",
#     "http://judaism.stackexchange.com",
#     "http://travel.stackexchange.com",
#     "http://christianity.stackexchange.com",
#     "http://gaming.stackexchange.com",
#     "http://bicycles.stackexchange.com",
#     "http://rpg.stackexchange.com",
#     "http://stackexchange.com/sites#culturerecreation",
#     "http://math.stackexchange.com",
#     "http://stats.stackexchange.com",
#     "http://cstheory.stackexchange.com",
#     "http://physics.stackexchange.com",
#     "http://mathoverflow.net",
#     "http://stackexchange.com/sites#science",
#     "http://stackapps.com",
#     "http://meta.stackoverflow.com",
#     "http://area51.stackexchange.com",
#     "http://careers.stackoverflow.com",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://blog.stackoverflow.com/2009/06/attribution-required/",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif",
#     "https:",
#     "https:'==document.location.protocol,",
#     "https://ssl",
#     "http://www",
#     "https://secure",
#     "http://edge",
#     "https:",
#     "https://sb",
#     "http://b"]
还有很多其他条目,但是使用
grep
可以使用简单的
/^https?:/
模式过滤掉它们

Nokogiri的一个简单起点是:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read)
urls = doc.search('a, img').map{ |tag| 
  case tag.name.downcase
  when 'a'
    tag['href']
  when 'img'
    tag['src']
  end
}

urls 
# => ["//stackexchange.com/sites",
#     "http://chat.stackoverflow.com",
#     "http://blog.stackexchange.com",
#     "//stackoverflow.com",
#     "//meta.stackoverflow.com",
#     "//careers.stackoverflow.com",
#     "//stackexchange.com",
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
#     "/tour",
#     "/help",
#     "//careers.stackoverflow.com",
#     "/",
#     "/questions",
#     "/tags",
#     "/about",
#     "/users",
#     "/questions/ask",
#     "/about",
#     nil,
#     "/questions/21069348/extract-all-links-from-web-page",
#     nil,
#     nil,
#     "#",
#     "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "/q/21069348",
#     "/posts/21069348/edit",
#     "/users/2886945/ivan-denisov",
#     "/users/2886945/ivan-denisov",
#     "/users/2767755/arup-rakshit",
#     "/users/2886945/ivan-denisov",
#     nil,
#     nil,
#     "/questions/21069348/extract-all-links-from-web-page?answertab=active#tab-top",
#     "/questions/21069348/extract-all-links-from-web-page?answertab=oldest#tab-top",
#     "/questions/21069348/extract-all-links-from-web-page?answertab=votes#tab-top",
#     nil,
#     nil,
#     nil,
#     "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
#     "/a/21069456",
#     "/posts/21069456/revisions",
#     "/users/128421/the-tin-man",
#     "/users/128421/the-tin-man",
#     nil,
#     nil,
#     nil,
#     nil,
#     "http://regex101.com/r/hN4dI0",
#     "/a/21069536",
#     "/users/1214800/r3mus",
#     "/users/1214800/r3mus",
#     nil,
#     nil,
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%23new-answer",
#     "#",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/legal/terms-of-service",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "/questions/ask",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "?lastactivity",
#     "/q/21052437",
#     "/questions/21052437/are-these-two-lines-the-same-vs",
#     "/q/6700367",
#     "/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "/q/430966",
#     "/questions/430966/regex-for-links-in-html-text",
#     "/q/3703712",
#     "/questions/3703712/extract-all-links-from-a-html-page-exclude-links-from-a-specific-table",
#     "/q/5120171",
#     "/questions/5120171/extract-links-from-a-web-page",
#     "/q/6816138",
#     "/questions/6816138/extract-absolute-links-from-a-page-uisng-htmlparser",
#     "/q/10177910",
#     "/questions/10177910/php-regular-expression-extracting-html-links",
#     "/q/10217857",
#     "/questions/10217857/extracting-background-images-from-a-web-page-parsing-htmlcss",
#     "/q/11300496",
#     "/questions/11300496/how-to-extract-a-link-from-head-tag-of-a-remote-page-using-curl",
#     "/q/11307491",
#     "/questions/11307491/how-to-extract-all-links-on-a-page-using-crawler4j",
#     "/q/17712493",
#     "/questions/17712493/extract-links-from-bbcode-with-ruby",
#     "/q/20290869",
#     "/questions/20290869/strip-away-html-tags-from-extracted-links",
#     "//stackexchange.com/questions?tab=hot",
#     "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
#     "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
#     "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
#     "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
#     "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
#     "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
#     "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
#     "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
#     "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
#     "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
#     "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
#     "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
#     "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
#     "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
#     "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
#     "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
#     "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
#     "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
#     "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
#     "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
#     "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
#     "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
#     "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
#     "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
#     "#",
#     "/feeds/question/21069348",
#     "/about",
#     "/help",
#     "/help/badges",
#     "http://blog.stackexchange.com?blb=1",
#     "http://chat.stackoverflow.com",
#     "http://data.stackexchange.com",
#     "http://stackexchange.com/legal",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/about/hiring",
#     "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
#     nil,
#     "/contact",
#     "http://meta.stackoverflow.com",
#     "http://stackoverflow.com",
#     "http://serverfault.com",
#     "http://superuser.com",
#     "http://webapps.stackexchange.com",
#     "http://askubuntu.com",
#     "http://webmasters.stackexchange.com",
#     "http://gamedev.stackexchange.com",
#     "http://tex.stackexchange.com",
#     "http://programmers.stackexchange.com",
#     "http://unix.stackexchange.com",
#     "http://apple.stackexchange.com",
#     "http://wordpress.stackexchange.com",
#     "http://gis.stackexchange.com",
#     "http://electronics.stackexchange.com",
#     "http://android.stackexchange.com",
#     "http://security.stackexchange.com",
#     "http://dba.stackexchange.com",
#     "http://drupal.stackexchange.com",
#     "http://sharepoint.stackexchange.com",
#     "http://ux.stackexchange.com",
#     "http://mathematica.stackexchange.com",
#     "http://stackexchange.com/sites#technology",
#     "http://photo.stackexchange.com",
#     "http://scifi.stackexchange.com",
#     "http://cooking.stackexchange.com",
#     "http://diy.stackexchange.com",
#     "http://stackexchange.com/sites#lifearts",
#     "http://english.stackexchange.com",
#     "http://skeptics.stackexchange.com",
#     "http://judaism.stackexchange.com",
#     "http://travel.stackexchange.com",
#     "http://christianity.stackexchange.com",
#     "http://gaming.stackexchange.com",
#     "http://bicycles.stackexchange.com",
#     "http://rpg.stackexchange.com",
#     "http://stackexchange.com/sites#culturerecreation",
#     "http://math.stackexchange.com",
#     "http://stats.stackexchange.com",
#     "http://cstheory.stackexchange.com",
#     "http://physics.stackexchange.com",
#     "http://mathoverflow.net",
#     "http://stackexchange.com/sites#science",
#     "http://stackapps.com",
#     "http://meta.stackoverflow.com",
#     "http://area51.stackexchange.com",
#     "http://careers.stackoverflow.com",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://blog.stackoverflow.com/2009/06/attribution-required/",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
#     "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
#     "http://i.stack.imgur.com/fmgha.jpg?s=32&g=1",
#     "/posts/21069348/ivc/8228",
#     "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif"]

它使用一个
case
语句应用一点“智能”来知道应该从特定类型的标记中检索哪个字段。还需要做更多的工作,因为锚点可以在单击时使用
,并且JavaScript事件可能会使用其他标记。

您可以使用Ruby的内置URI类来完成这项工作。看看这个方法

它不像使用Nokogiri编写的那样智能,可以查看锚、图像、脚本、点击处理程序等
,但它是一个良好且快速的起点

例如,查看此问题页面的内容:

require 'open-uri'
require 'uri'

URI.extract(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read).grep(/^https?:/)
# => ["http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6",
#     "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
#     "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
#     "https://stackauth.com",
#     "http://chat.stackoverflow.com",
#     "http://blog.stackexchange.com",
#     "http://schema.org/Article",
#     "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
#     "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
#     "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
#     "http://stackexchange.com/legal/privacy-policy'",
#     "http://stackexchange.com/legal/terms-of-service'",
#     "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
#     "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
#     "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
#     "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
#     "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
#     "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
#     "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
#     "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
#     "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
#     "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
#     "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
#     "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
#     "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
#     "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
#     "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
#     "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
#     "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
#     "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
#     "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
#     "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
#     "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
#     "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
#     "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
#     "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
#     "http://blog.stackexchange.com?blb=1",
#     "http://chat.stackoverflow.com",
#     "http://data.stackexchange.com",
#     "http://stackexchange.com/legal",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/about/hiring",
#     "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
#     "http://meta.stackoverflow.com",
#     "http://stackoverflow.com",
#     "http://serverfault.com",
#     "http://superuser.com",
#     "http://webapps.stackexchange.com",
#     "http://askubuntu.com",
#     "http://webmasters.stackexchange.com",
#     "http://gamedev.stackexchange.com",
#     "http://tex.stackexchange.com",
#     "http://programmers.stackexchange.com",
#     "http://unix.stackexchange.com",
#     "http://apple.stackexchange.com",
#     "http://wordpress.stackexchange.com",
#     "http://gis.stackexchange.com",
#     "http://electronics.stackexchange.com",
#     "http://android.stackexchange.com",
#     "http://security.stackexchange.com",
#     "http://dba.stackexchange.com",
#     "http://drupal.stackexchange.com",
#     "http://sharepoint.stackexchange.com",
#     "http://ux.stackexchange.com",
#     "http://mathematica.stackexchange.com",
#     "http://stackexchange.com/sites#technology",
#     "http://photo.stackexchange.com",
#     "http://scifi.stackexchange.com",
#     "http://cooking.stackexchange.com",
#     "http://diy.stackexchange.com",
#     "http://stackexchange.com/sites#lifearts",
#     "http://english.stackexchange.com",
#     "http://skeptics.stackexchange.com",
#     "http://judaism.stackexchange.com",
#     "http://travel.stackexchange.com",
#     "http://christianity.stackexchange.com",
#     "http://gaming.stackexchange.com",
#     "http://bicycles.stackexchange.com",
#     "http://rpg.stackexchange.com",
#     "http://stackexchange.com/sites#culturerecreation",
#     "http://math.stackexchange.com",
#     "http://stats.stackexchange.com",
#     "http://cstheory.stackexchange.com",
#     "http://physics.stackexchange.com",
#     "http://mathoverflow.net",
#     "http://stackexchange.com/sites#science",
#     "http://stackapps.com",
#     "http://meta.stackoverflow.com",
#     "http://area51.stackexchange.com",
#     "http://careers.stackoverflow.com",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://blog.stackoverflow.com/2009/06/attribution-required/",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif",
#     "https:",
#     "https:'==document.location.protocol,",
#     "https://ssl",
#     "http://www",
#     "https://secure",
#     "http://edge",
#     "https:",
#     "https://sb",
#     "http://b"]
还有很多其他条目,但是使用
grep
可以使用简单的
/^https?:/
模式过滤掉它们

Nokogiri的一个简单起点是:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read)
urls = doc.search('a, img').map{ |tag| 
  case tag.name.downcase
  when 'a'
    tag['href']
  when 'img'
    tag['src']
  end
}

urls 
# => ["//stackexchange.com/sites",
#     "http://chat.stackoverflow.com",
#     "http://blog.stackexchange.com",
#     "//stackoverflow.com",
#     "//meta.stackoverflow.com",
#     "//careers.stackoverflow.com",
#     "//stackexchange.com",
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
#     "/tour",
#     "/help",
#     "//careers.stackoverflow.com",
#     "/",
#     "/questions",
#     "/tags",
#     "/about",
#     "/users",
#     "/questions/ask",
#     "/about",
#     nil,
#     "/questions/21069348/extract-all-links-from-web-page",
#     nil,
#     nil,
#     "#",
#     "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "/q/21069348",
#     "/posts/21069348/edit",
#     "/users/2886945/ivan-denisov",
#     "/users/2886945/ivan-denisov",
#     "/users/2767755/arup-rakshit",
#     "/users/2886945/ivan-denisov",
#     nil,
#     nil,
#     "/questions/21069348/extract-all-links-from-web-page?answertab=active#tab-top",
#     "/questions/21069348/extract-all-links-from-web-page?answertab=oldest#tab-top",
#     "/questions/21069348/extract-all-links-from-web-page?answertab=votes#tab-top",
#     nil,
#     nil,
#     nil,
#     "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
#     "/a/21069456",
#     "/posts/21069456/revisions",
#     "/users/128421/the-tin-man",
#     "/users/128421/the-tin-man",
#     nil,
#     nil,
#     nil,
#     nil,
#     "http://regex101.com/r/hN4dI0",
#     "/a/21069536",
#     "/users/1214800/r3mus",
#     "/users/1214800/r3mus",
#     nil,
#     nil,
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%23new-answer",
#     "#",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/legal/terms-of-service",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "/questions/ask",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "?lastactivity",
#     "/q/21052437",
#     "/questions/21052437/are-these-two-lines-the-same-vs",
#     "/q/6700367",
#     "/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "/q/430966",
#     "/questions/430966/regex-for-links-in-html-text",
#     "/q/3703712",
#     "/questions/3703712/extract-all-links-from-a-html-page-exclude-links-from-a-specific-table",
#     "/q/5120171",
#     "/questions/5120171/extract-links-from-a-web-page",
#     "/q/6816138",
#     "/questions/6816138/extract-absolute-links-from-a-page-uisng-htmlparser",
#     "/q/10177910",
#     "/questions/10177910/php-regular-expression-extracting-html-links",
#     "/q/10217857",
#     "/questions/10217857/extracting-background-images-from-a-web-page-parsing-htmlcss",
#     "/q/11300496",
#     "/questions/11300496/how-to-extract-a-link-from-head-tag-of-a-remote-page-using-curl",
#     "/q/11307491",
#     "/questions/11307491/how-to-extract-all-links-on-a-page-using-crawler4j",
#     "/q/17712493",
#     "/questions/17712493/extract-links-from-bbcode-with-ruby",
#     "/q/20290869",
#     "/questions/20290869/strip-away-html-tags-from-extracted-links",
#     "//stackexchange.com/questions?tab=hot",
#     "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
#     "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
#     "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
#     "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
#     "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
#     "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
#     "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
#     "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
#     "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
#     "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
#     "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
#     "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
#     "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
#     "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
#     "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
#     "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
#     "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
#     "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
#     "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
#     "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
#     "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
#     "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
#     "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
#     "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
#     "#",
#     "/feeds/question/21069348",
#     "/about",
#     "/help",
#     "/help/badges",
#     "http://blog.stackexchange.com?blb=1",
#     "http://chat.stackoverflow.com",
#     "http://data.stackexchange.com",
#     "http://stackexchange.com/legal",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/about/hiring",
#     "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
#     nil,
#     "/contact",
#     "http://meta.stackoverflow.com",
#     "http://stackoverflow.com",
#     "http://serverfault.com",
#     "http://superuser.com",
#     "http://webapps.stackexchange.com",
#     "http://askubuntu.com",
#     "http://webmasters.stackexchange.com",
#     "http://gamedev.stackexchange.com",
#     "http://tex.stackexchange.com",
#     "http://programmers.stackexchange.com",
#     "http://unix.stackexchange.com",
#     "http://apple.stackexchange.com",
#     "http://wordpress.stackexchange.com",
#     "http://gis.stackexchange.com",
#     "http://electronics.stackexchange.com",
#     "http://android.stackexchange.com",
#     "http://security.stackexchange.com",
#     "http://dba.stackexchange.com",
#     "http://drupal.stackexchange.com",
#     "http://sharepoint.stackexchange.com",
#     "http://ux.stackexchange.com",
#     "http://mathematica.stackexchange.com",
#     "http://stackexchange.com/sites#technology",
#     "http://photo.stackexchange.com",
#     "http://scifi.stackexchange.com",
#     "http://cooking.stackexchange.com",
#     "http://diy.stackexchange.com",
#     "http://stackexchange.com/sites#lifearts",
#     "http://english.stackexchange.com",
#     "http://skeptics.stackexchange.com",
#     "http://judaism.stackexchange.com",
#     "http://travel.stackexchange.com",
#     "http://christianity.stackexchange.com",
#     "http://gaming.stackexchange.com",
#     "http://bicycles.stackexchange.com",
#     "http://rpg.stackexchange.com",
#     "http://stackexchange.com/sites#culturerecreation",
#     "http://math.stackexchange.com",
#     "http://stats.stackexchange.com",
#     "http://cstheory.stackexchange.com",
#     "http://physics.stackexchange.com",
#     "http://mathoverflow.net",
#     "http://stackexchange.com/sites#science",
#     "http://stackapps.com",
#     "http://meta.stackoverflow.com",
#     "http://area51.stackexchange.com",
#     "http://careers.stackoverflow.com",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://blog.stackoverflow.com/2009/06/attribution-required/",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
#     "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
#     "http://i.stack.imgur.com/fmgha.jpg?s=32&g=1",
#     "/posts/21069348/ivc/8228",
#     "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif"]

它使用一个
case
语句应用一点“智能”来知道应该从特定类型的标记中检索哪个字段。还需要做更多的工作,因为锚点可以在单击时使用
,并且JavaScript事件可能会使用其他标记。

您可以使用Ruby的内置URI类来完成这项工作。看看这个方法

它不像使用Nokogiri编写的那样智能,可以查看锚、图像、脚本、点击处理程序等
,但它是一个良好且快速的起点

例如,查看此问题页面的内容:

require 'open-uri'
require 'uri'

URI.extract(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read).grep(/^https?:/)
# => ["http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6",
#     "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
#     "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
#     "https://stackauth.com",
#     "http://chat.stackoverflow.com",
#     "http://blog.stackexchange.com",
#     "http://schema.org/Article",
#     "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
#     "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
#     "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
#     "http://stackexchange.com/legal/privacy-policy'",
#     "http://stackexchange.com/legal/terms-of-service'",
#     "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
#     "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
#     "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
#     "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
#     "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
#     "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
#     "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
#     "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
#     "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
#     "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
#     "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
#     "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
#     "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
#     "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
#     "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
#     "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
#     "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
#     "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
#     "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
#     "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
#     "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
#     "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
#     "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
#     "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
#     "http://blog.stackexchange.com?blb=1",
#     "http://chat.stackoverflow.com",
#     "http://data.stackexchange.com",
#     "http://stackexchange.com/legal",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/about/hiring",
#     "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
#     "http://meta.stackoverflow.com",
#     "http://stackoverflow.com",
#     "http://serverfault.com",
#     "http://superuser.com",
#     "http://webapps.stackexchange.com",
#     "http://askubuntu.com",
#     "http://webmasters.stackexchange.com",
#     "http://gamedev.stackexchange.com",
#     "http://tex.stackexchange.com",
#     "http://programmers.stackexchange.com",
#     "http://unix.stackexchange.com",
#     "http://apple.stackexchange.com",
#     "http://wordpress.stackexchange.com",
#     "http://gis.stackexchange.com",
#     "http://electronics.stackexchange.com",
#     "http://android.stackexchange.com",
#     "http://security.stackexchange.com",
#     "http://dba.stackexchange.com",
#     "http://drupal.stackexchange.com",
#     "http://sharepoint.stackexchange.com",
#     "http://ux.stackexchange.com",
#     "http://mathematica.stackexchange.com",
#     "http://stackexchange.com/sites#technology",
#     "http://photo.stackexchange.com",
#     "http://scifi.stackexchange.com",
#     "http://cooking.stackexchange.com",
#     "http://diy.stackexchange.com",
#     "http://stackexchange.com/sites#lifearts",
#     "http://english.stackexchange.com",
#     "http://skeptics.stackexchange.com",
#     "http://judaism.stackexchange.com",
#     "http://travel.stackexchange.com",
#     "http://christianity.stackexchange.com",
#     "http://gaming.stackexchange.com",
#     "http://bicycles.stackexchange.com",
#     "http://rpg.stackexchange.com",
#     "http://stackexchange.com/sites#culturerecreation",
#     "http://math.stackexchange.com",
#     "http://stats.stackexchange.com",
#     "http://cstheory.stackexchange.com",
#     "http://physics.stackexchange.com",
#     "http://mathoverflow.net",
#     "http://stackexchange.com/sites#science",
#     "http://stackapps.com",
#     "http://meta.stackoverflow.com",
#     "http://area51.stackexchange.com",
#     "http://careers.stackoverflow.com",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://blog.stackoverflow.com/2009/06/attribution-required/",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif",
#     "https:",
#     "https:'==document.location.protocol,",
#     "https://ssl",
#     "http://www",
#     "https://secure",
#     "http://edge",
#     "https:",
#     "https://sb",
#     "http://b"]
还有很多其他条目,但是使用
grep
可以使用简单的
/^https?:/
模式过滤掉它们

Nokogiri的一个简单起点是:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read)
urls = doc.search('a, img').map{ |tag| 
  case tag.name.downcase
  when 'a'
    tag['href']
  when 'img'
    tag['src']
  end
}

urls 
# => ["//stackexchange.com/sites",
#     "http://chat.stackoverflow.com",
#     "http://blog.stackexchange.com",
#     "//stackoverflow.com",
#     "//meta.stackoverflow.com",
#     "//careers.stackoverflow.com",
#     "//stackexchange.com",
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
#     "/tour",
#     "/help",
#     "//careers.stackoverflow.com",
#     "/",
#     "/questions",
#     "/tags",
#     "/about",
#     "/users",
#     "/questions/ask",
#     "/about",
#     nil,
#     "/questions/21069348/extract-all-links-from-web-page",
#     nil,
#     nil,
#     "#",
#     "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "/q/21069348",
#     "/posts/21069348/edit",
#     "/users/2886945/ivan-denisov",
#     "/users/2886945/ivan-denisov",
#     "/users/2767755/arup-rakshit",
#     "/users/2886945/ivan-denisov",
#     nil,
#     nil,
#     "/questions/21069348/extract-all-links-from-web-page?answertab=active#tab-top",
#     "/questions/21069348/extract-all-links-from-web-page?answertab=oldest#tab-top",
#     "/questions/21069348/extract-all-links-from-web-page?answertab=votes#tab-top",
#     nil,
#     nil,
#     nil,
#     "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
#     "/a/21069456",
#     "/posts/21069456/revisions",
#     "/users/128421/the-tin-man",
#     "/users/128421/the-tin-man",
#     nil,
#     nil,
#     nil,
#     nil,
#     "http://regex101.com/r/hN4dI0",
#     "/a/21069536",
#     "/users/1214800/r3mus",
#     "/users/1214800/r3mus",
#     nil,
#     nil,
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%23new-answer",
#     "#",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/legal/terms-of-service",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "/questions/ask",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "?lastactivity",
#     "/q/21052437",
#     "/questions/21052437/are-these-two-lines-the-same-vs",
#     "/q/6700367",
#     "/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "/q/430966",
#     "/questions/430966/regex-for-links-in-html-text",
#     "/q/3703712",
#     "/questions/3703712/extract-all-links-from-a-html-page-exclude-links-from-a-specific-table",
#     "/q/5120171",
#     "/questions/5120171/extract-links-from-a-web-page",
#     "/q/6816138",
#     "/questions/6816138/extract-absolute-links-from-a-page-uisng-htmlparser",
#     "/q/10177910",
#     "/questions/10177910/php-regular-expression-extracting-html-links",
#     "/q/10217857",
#     "/questions/10217857/extracting-background-images-from-a-web-page-parsing-htmlcss",
#     "/q/11300496",
#     "/questions/11300496/how-to-extract-a-link-from-head-tag-of-a-remote-page-using-curl",
#     "/q/11307491",
#     "/questions/11307491/how-to-extract-all-links-on-a-page-using-crawler4j",
#     "/q/17712493",
#     "/questions/17712493/extract-links-from-bbcode-with-ruby",
#     "/q/20290869",
#     "/questions/20290869/strip-away-html-tags-from-extracted-links",
#     "//stackexchange.com/questions?tab=hot",
#     "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
#     "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
#     "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
#     "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
#     "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
#     "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
#     "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
#     "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
#     "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
#     "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
#     "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
#     "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
#     "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
#     "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
#     "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
#     "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
#     "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
#     "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
#     "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
#     "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
#     "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
#     "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
#     "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
#     "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
#     "#",
#     "/feeds/question/21069348",
#     "/about",
#     "/help",
#     "/help/badges",
#     "http://blog.stackexchange.com?blb=1",
#     "http://chat.stackoverflow.com",
#     "http://data.stackexchange.com",
#     "http://stackexchange.com/legal",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/about/hiring",
#     "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
#     nil,
#     "/contact",
#     "http://meta.stackoverflow.com",
#     "http://stackoverflow.com",
#     "http://serverfault.com",
#     "http://superuser.com",
#     "http://webapps.stackexchange.com",
#     "http://askubuntu.com",
#     "http://webmasters.stackexchange.com",
#     "http://gamedev.stackexchange.com",
#     "http://tex.stackexchange.com",
#     "http://programmers.stackexchange.com",
#     "http://unix.stackexchange.com",
#     "http://apple.stackexchange.com",
#     "http://wordpress.stackexchange.com",
#     "http://gis.stackexchange.com",
#     "http://electronics.stackexchange.com",
#     "http://android.stackexchange.com",
#     "http://security.stackexchange.com",
#     "http://dba.stackexchange.com",
#     "http://drupal.stackexchange.com",
#     "http://sharepoint.stackexchange.com",
#     "http://ux.stackexchange.com",
#     "http://mathematica.stackexchange.com",
#     "http://stackexchange.com/sites#technology",
#     "http://photo.stackexchange.com",
#     "http://scifi.stackexchange.com",
#     "http://cooking.stackexchange.com",
#     "http://diy.stackexchange.com",
#     "http://stackexchange.com/sites#lifearts",
#     "http://english.stackexchange.com",
#     "http://skeptics.stackexchange.com",
#     "http://judaism.stackexchange.com",
#     "http://travel.stackexchange.com",
#     "http://christianity.stackexchange.com",
#     "http://gaming.stackexchange.com",
#     "http://bicycles.stackexchange.com",
#     "http://rpg.stackexchange.com",
#     "http://stackexchange.com/sites#culturerecreation",
#     "http://math.stackexchange.com",
#     "http://stats.stackexchange.com",
#     "http://cstheory.stackexchange.com",
#     "http://physics.stackexchange.com",
#     "http://mathoverflow.net",
#     "http://stackexchange.com/sites#science",
#     "http://stackapps.com",
#     "http://meta.stackoverflow.com",
#     "http://area51.stackexchange.com",
#     "http://careers.stackoverflow.com",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://blog.stackoverflow.com/2009/06/attribution-required/",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
#     "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
#     "http://i.stack.imgur.com/fmgha.jpg?s=32&g=1",
#     "/posts/21069348/ivc/8228",
#     "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif"]

它使用一个
case
语句应用一点“智能”来知道应该从特定类型的标记中检索哪个字段。还需要做更多的工作,因为锚点可以在单击时使用
,并且JavaScript事件可能会使用其他标记。

您可以使用Ruby的内置URI类来完成这项工作。看看这个方法

它不像使用Nokogiri编写的那样智能,可以查看锚、图像、脚本、点击处理程序等
,但它是一个良好且快速的起点

例如,查看此问题页面的内容:

require 'open-uri'
require 'uri'

URI.extract(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read).grep(/^https?:/)
# => ["http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6",
#     "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
#     "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
#     "https://stackauth.com",
#     "http://chat.stackoverflow.com",
#     "http://blog.stackexchange.com",
#     "http://schema.org/Article",
#     "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
#     "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
#     "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
#     "http://stackexchange.com/legal/privacy-policy'",
#     "http://stackexchange.com/legal/terms-of-service'",
#     "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
#     "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
#     "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
#     "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
#     "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
#     "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
#     "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
#     "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
#     "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
#     "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
#     "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
#     "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
#     "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
#     "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
#     "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
#     "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
#     "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
#     "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
#     "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
#     "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
#     "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
#     "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
#     "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
#     "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
#     "http://blog.stackexchange.com?blb=1",
#     "http://chat.stackoverflow.com",
#     "http://data.stackexchange.com",
#     "http://stackexchange.com/legal",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/about/hiring",
#     "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
#     "http://meta.stackoverflow.com",
#     "http://stackoverflow.com",
#     "http://serverfault.com",
#     "http://superuser.com",
#     "http://webapps.stackexchange.com",
#     "http://askubuntu.com",
#     "http://webmasters.stackexchange.com",
#     "http://gamedev.stackexchange.com",
#     "http://tex.stackexchange.com",
#     "http://programmers.stackexchange.com",
#     "http://unix.stackexchange.com",
#     "http://apple.stackexchange.com",
#     "http://wordpress.stackexchange.com",
#     "http://gis.stackexchange.com",
#     "http://electronics.stackexchange.com",
#     "http://android.stackexchange.com",
#     "http://security.stackexchange.com",
#     "http://dba.stackexchange.com",
#     "http://drupal.stackexchange.com",
#     "http://sharepoint.stackexchange.com",
#     "http://ux.stackexchange.com",
#     "http://mathematica.stackexchange.com",
#     "http://stackexchange.com/sites#technology",
#     "http://photo.stackexchange.com",
#     "http://scifi.stackexchange.com",
#     "http://cooking.stackexchange.com",
#     "http://diy.stackexchange.com",
#     "http://stackexchange.com/sites#lifearts",
#     "http://english.stackexchange.com",
#     "http://skeptics.stackexchange.com",
#     "http://judaism.stackexchange.com",
#     "http://travel.stackexchange.com",
#     "http://christianity.stackexchange.com",
#     "http://gaming.stackexchange.com",
#     "http://bicycles.stackexchange.com",
#     "http://rpg.stackexchange.com",
#     "http://stackexchange.com/sites#culturerecreation",
#     "http://math.stackexchange.com",
#     "http://stats.stackexchange.com",
#     "http://cstheory.stackexchange.com",
#     "http://physics.stackexchange.com",
#     "http://mathoverflow.net",
#     "http://stackexchange.com/sites#science",
#     "http://stackapps.com",
#     "http://meta.stackoverflow.com",
#     "http://area51.stackexchange.com",
#     "http://careers.stackoverflow.com",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://blog.stackoverflow.com/2009/06/attribution-required/",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif",
#     "https:",
#     "https:'==document.location.protocol,",
#     "https://ssl",
#     "http://www",
#     "https://secure",
#     "http://edge",
#     "https:",
#     "https://sb",
#     "http://b"]
还有很多其他条目,但是使用
grep
可以使用简单的
/^https?:/
模式过滤掉它们

Nokogiri的一个简单起点是:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read)
urls = doc.search('a, img').map{ |tag| 
  case tag.name.downcase
  when 'a'
    tag['href']
  when 'img'
    tag['src']
  end
}

urls 
# => ["//stackexchange.com/sites",
#     "http://chat.stackoverflow.com",
#     "http://blog.stackexchange.com",
#     "//stackoverflow.com",
#     "//meta.stackoverflow.com",
#     "//careers.stackoverflow.com",
#     "//stackexchange.com",
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
#     "/tour",
#     "/help",
#     "//careers.stackoverflow.com",
#     "/",
#     "/questions",
#     "/tags",
#     "/about",
#     "/users",
#     "/questions/ask",
#     "/about",
#     nil,
#     "/questions/21069348/extract-all-links-from-web-page",
#     nil,
#     nil,
#     "#",
#     "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "/q/21069348",
#     "/posts/21069348/edit",
#     "/users/2886945/ivan-denisov",
#     "/users/2886945/ivan-denisov",
#     "/users/2767755/arup-rakshit",
#     "/users/2886945/ivan-denisov",
#     nil,
#     nil,
#     "/questions/21069348/extract-all-links-from-web-page?answertab=active#tab-top",
#     "/questions/21069348/extract-all-links-from-web-page?answertab=oldest#tab-top",
#     "/questions/21069348/extract-all-links-from-web-page?answertab=votes#tab-top",
#     nil,
#     nil,
#     nil,
#     "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
#     "/a/21069456",
#     "/posts/21069456/revisions",
#     "/users/128421/the-tin-man",
#     "/users/128421/the-tin-man",
#     nil,
#     nil,
#     nil,
#     nil,
#     "http://regex101.com/r/hN4dI0",
#     "/a/21069536",
#     "/users/1214800/r3mus",
#     "/users/1214800/r3mus",
#     nil,
#     nil,
#     "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%23new-answer",
#     "#",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/legal/terms-of-service",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "/questions/ask",
#     "/questions/tagged/html",
#     "/questions/tagged/ruby-on-rails",
#     "/questions/tagged/ruby",
#     "/questions/tagged/regex",
#     "/questions/tagged/hyperlink",
#     "?lastactivity",
#     "/q/21052437",
#     "/questions/21052437/are-these-two-lines-the-same-vs",
#     "/q/6700367",
#     "/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
#     "/q/430966",
#     "/questions/430966/regex-for-links-in-html-text",
#     "/q/3703712",
#     "/questions/3703712/extract-all-links-from-a-html-page-exclude-links-from-a-specific-table",
#     "/q/5120171",
#     "/questions/5120171/extract-links-from-a-web-page",
#     "/q/6816138",
#     "/questions/6816138/extract-absolute-links-from-a-page-uisng-htmlparser",
#     "/q/10177910",
#     "/questions/10177910/php-regular-expression-extracting-html-links",
#     "/q/10217857",
#     "/questions/10217857/extracting-background-images-from-a-web-page-parsing-htmlcss",
#     "/q/11300496",
#     "/questions/11300496/how-to-extract-a-link-from-head-tag-of-a-remote-page-using-curl",
#     "/q/11307491",
#     "/questions/11307491/how-to-extract-all-links-on-a-page-using-crawler4j",
#     "/q/17712493",
#     "/questions/17712493/extract-links-from-bbcode-with-ruby",
#     "/q/20290869",
#     "/questions/20290869/strip-away-html-tags-from-extracted-links",
#     "//stackexchange.com/questions?tab=hot",
#     "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
#     "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
#     "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
#     "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
#     "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
#     "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
#     "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
#     "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
#     "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
#     "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
#     "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
#     "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
#     "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
#     "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
#     "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
#     "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
#     "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
#     "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
#     "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
#     "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
#     "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
#     "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
#     "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
#     "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
#     "#",
#     "/feeds/question/21069348",
#     "/about",
#     "/help",
#     "/help/badges",
#     "http://blog.stackexchange.com?blb=1",
#     "http://chat.stackoverflow.com",
#     "http://data.stackexchange.com",
#     "http://stackexchange.com/legal",
#     "http://stackexchange.com/legal/privacy-policy",
#     "http://stackexchange.com/about/hiring",
#     "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
#     nil,
#     "/contact",
#     "http://meta.stackoverflow.com",
#     "http://stackoverflow.com",
#     "http://serverfault.com",
#     "http://superuser.com",
#     "http://webapps.stackexchange.com",
#     "http://askubuntu.com",
#     "http://webmasters.stackexchange.com",
#     "http://gamedev.stackexchange.com",
#     "http://tex.stackexchange.com",
#     "http://programmers.stackexchange.com",
#     "http://unix.stackexchange.com",
#     "http://apple.stackexchange.com",
#     "http://wordpress.stackexchange.com",
#     "http://gis.stackexchange.com",
#     "http://electronics.stackexchange.com",
#     "http://android.stackexchange.com",
#     "http://security.stackexchange.com",
#     "http://dba.stackexchange.com",
#     "http://drupal.stackexchange.com",
#     "http://sharepoint.stackexchange.com",
#     "http://ux.stackexchange.com",
#     "http://mathematica.stackexchange.com",
#     "http://stackexchange.com/sites#technology",
#     "http://photo.stackexchange.com",
#     "http://scifi.stackexchange.com",
#     "http://cooking.stackexchange.com",
#     "http://diy.stackexchange.com",
#     "http://stackexchange.com/sites#lifearts",
#     "http://english.stackexchange.com",
#     "http://skeptics.stackexchange.com",
#     "http://judaism.stackexchange.com",
#     "http://travel.stackexchange.com",
#     "http://christianity.stackexchange.com",
#     "http://gaming.stackexchange.com",
#     "http://bicycles.stackexchange.com",
#     "http://rpg.stackexchange.com",
#     "http://stackexchange.com/sites#culturerecreation",
#     "http://math.stackexchange.com",
#     "http://stats.stackexchange.com",
#     "http://cstheory.stackexchange.com",
#     "http://physics.stackexchange.com",
#     "http://mathoverflow.net",
#     "http://stackexchange.com/sites#science",
#     "http://stackapps.com",
#     "http://meta.stackoverflow.com",
#     "http://area51.stackexchange.com",
#     "http://careers.stackoverflow.com",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://blog.stackoverflow.com/2009/06/attribution-required/",
#     "http://creativecommons.org/licenses/by-sa/3.0/",
#     "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
#     "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
#     "http://i.stack.imgur.com/fmgha.jpg?s=32&g=1",
#     "/posts/21069348/ivc/8228",
#     "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif"]

它使用一个
case
语句应用一点“智能”来知道应该从特定类型的标记中检索哪个字段。还需要做更多的工作,因为主播可以在点击时使用
,并且可能有其他标记用于JavaScript事件。

我同意铁皮人的答案无疑是最好的途径。如果您确实需要一个catchall正则表达式来抓取所有URL(尽可能精确),这应该可以:

\w+:\/\/[\w.-]+(?::?\d{1,5})?[-\w.\/?=&%]*
参见一些示例:


请注意,它需要协议前缀(
http://
mailto://
),因此它不会只匹配
www.google.com
,,我同意铁皮人的答案无疑是最好的路线。如果您确实需要一个catchall正则表达式来抓取所有URL(尽可能精确),这应该可以:

\w+:\/\/[\w.-]+(?::?\d{1,5})?[-\w.\/?=&%]*
参见一些示例:


请注意,它需要协议前缀(
http://
mailto://
),因此它不会只匹配
www.google.com
,,我同意铁皮人的答案无疑是最好的路线。如果您确实需要一个catchall正则表达式来抓取所有URL(尽可能精确),这应该可以:

\w+:\/\/[\w.-]+(?::?\d{1,5})?[-\w.\/?=&%]*
参见一些示例:


请注意,它需要协议前缀(
http://
mailto://
),因此它不会只匹配
www.google.com
,,我同意铁皮人的答案无疑是最好的路线。如果您确实需要一个catchall正则表达式来抓取所有URL(尽可能精确),这应该可以:

\w+:\/\/[\w.-]+(?::?\d{1,5})?[-\w.\/?=&%]*
参见一些示例:


请注意,它需要协议前缀(
http://
mailto://
),因此它不会仅与
www.google.com

匹配。请给出html源代码和您想要的一些输出(只是给我们一个提示)以及任何网页。也许我要求的太多了,但我需要一个好的初学者。同时给出html源代码和一些你想要的输出(只是给我们一个提示)随便任何网页。也许我要求的太多了,但我需要一个好的初学者。同时给出html源代码和一些你想要的输出(只是给我们一个提示)随便任何网页。也许我要求的太多了,但我需要一个好的初学者。同时给出html源代码和一些你想要的输出(只是给我们一个提示)随便任何网页。也许我要求太多了,但我需要一个好的开端。这就是URI在内部所做的。@theTinMan是的;)我把这个放在这里主要是为了那些发现问题而不使用rails的人。URI在内部就是这样做的。@theTinMan是的;)我把这个放在这里主要是为了那些发现问题并且不使用铁路的人