Ruby 为什么我的网络爬网方法找不到所有的链接?
我正在尝试创建一个简单的网络爬虫,因此我写了以下内容: 方法get_links获取我们将从中查找的父链接Ruby 为什么我的网络爬网方法找不到所有的链接?,ruby,web-crawler,nokogiri,Ruby,Web Crawler,Nokogiri,我正在尝试创建一个简单的网络爬虫,因此我写了以下内容: 方法get_links获取我们将从中查找的父链接 require 'nokogiri' require 'open-uri' def get_links(link) link = "http://#{link}" doc = Nokogiri::HTML(open(link)) links = doc.css('a') hrefs = links.map {|link| link.attribute('hr
require 'nokogiri'
require 'open-uri'
def get_links(link)
link = "http://#{link}"
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
array = hrefs.select {|i| i[0] == "/"}
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
end
方法search\u links,从get\u links方法中获取一个数组并在此数组中搜索
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
end
这种方法可以找到网站上的大部分链接,但不是全部
我做错了什么?我应该使用哪种算法?关于您的代码的一些评论:
def get_links(link)
link = "http://#{link}"
# You're assuming the protocol is always http.
# This isn't the only protocol on used on the web.
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
# You can write these two lines more compact as
# hrefs = doc.xpath('//a/@href').map(&:to_s).uniq.delete_if(&:empty?)
array = hrefs.select {|i| i[0] == "/"}
# I guess you want to handle URLs that are relative to the host.
# However, URLs relative to the protocol (starting with '//')
# will also be selected by this condition.
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
# The value assigned to links_list will implicitly be returned.
# (The assignment itself is futile, the right-hand-part alone would
# suffice.) Because this builds on `array` all absolute URLs will be
# missing from the return value.
end
解释
hrefs = doc.xpath('//a/@href').map(&:to_s).uniq.delete_if(&:empty?)
.xpath'//a//@href'使用xpath的属性语法直接获取元素的href属性
.map&:to_s是.map{| item | item.to_s}的缩写符号
.如果&:空,是否删除?使用相同的缩写符号
以及对第二个功能的评论:
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
# How about using a Set instead of an Array and
# thus have the collection provide uniqueness of
# its items, so that you don't have to?
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
# This function isn't recursive, it just calls `get_links` on two
# 'levels'. Thus you search only two levels deep and return findings
# from the first and second level combined. (Without the "zero'th"
# level - the URL passed into `search_links`. Unless off course if it
# also occured on the first or second level.)
#
# Is this what you intended?
end
您可能应该使用mechanize:
require 'mechanize'
agent = Mechanize.new
page = agent.get url
links = page.search('a[href]').map{|a| page.uri.merge(a[:href]).to_s}
# if you want to remove links with a different host (hyperlinks?)
links.reject!{|l| URI.parse(l).host != page.uri.host}
否则,您将无法正确地将相对URL转换为绝对URL。请提供一个简短的HTML文档,其中的链接未被您的代码找到。这似乎是一个很好的反馈,但很难说这个问题是否有实际的答案。请你把答案说得更清楚一些好吗?@skrrgwsme,如果提问者提供了一些数据HTTP文档来证明他们当前的代码如何不符合他们的期望,我只能给出一个更清楚的答案。否则我只能猜测它可能会错过哪些链接。如果你认为我的答案可以改进,请随意编辑或提供你自己的答案。我之所以提到它,是因为你的答案出现在低质量的评论后队列中,所以有人对其进行了标记,我怀疑这是因为很难判断它是否真的回答了问题。如果你不把答案说得更清楚,它可能会被删除。用Nokogiri打开的块形式是没有必要的,也不是惯用的。垃圾回收期间,文件将在读取后关闭。我使用一个简单的脚本运行了一个测试,该脚本打开一个文件,将打开的文件句柄传递给Nokogiri,然后运行lsof在操作系统中查找打开的文件句柄,然后逐步切换到Nokogiri,在Nokogiri返回控制后,该文件仍然打开。打开文件句柄通常在Ruby在脚本末尾退出时关闭,而不是在垃圾处理程序处理时关闭。由于大多数Nokogiri脚本只读取一个文件,所以这不是问题,只有在读取足够多的文件以使用操作系统的文件句柄池时才是问题。