Ruby 如何使用nokogiri从URL列表中提取数据并将数据保存到CSV
我有一个名为bontyurls.csv的文件,如下所示:Ruby 如何使用nokogiri从URL列表中提取数据并将数据保存到CSV,ruby,csv,nokogiri,Ruby,Csv,Nokogiri,我有一个名为bontyurls.csv的文件,如下所示: http://bontrager.com/model/11383 http://bontrager.com/model/01740 http://bontrager.com/model/09595 我想让我的脚本读取该文件,然后输出一个如下文件:bonty_test_url_results.csv url,model_names http://bontrager.com/model/11383,"Road TLR Conversion
http://bontrager.com/model/11383
http://bontrager.com/model/01740
http://bontrager.com/model/09595
我想让我的脚本读取该文件,然后输出一个如下文件:bonty_test_url_results.csv
url,model_names
http://bontrager.com/model/11383,"Road TLR Conversion Kit"
http://bontrager.com/model/01740,"404 File Not Found"
http://bontrager.com/model/09595,"RXL Road"
以下是到目前为止我得到的信息:
# based on code from here: http://www.andrewsturges.com/2011/09/how-to-harvest-web-data-using-ruby-and.html
require 'nokogiri'
require 'open-uri'
require 'csv'
@urls = Array.new
@model_names = Array.new
urls = CSV.read("bontyurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
doc.xpath('//h1').each do |model_name|
@model_name << model_name.content
end
end
# write results to file
CSV.open("bonty_test_urls_results.csv", "wb") do |row|
row << ["url", "model_names"]
(0..@urls.length - 1).each do |index|
row << [
@urls[index],
@model_names[index]]
end
end
此外,我还没有弄清楚如何处理返回404的URL。我会这样做:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url model_names]
}
CSV.open('bonty_test_urls_results.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('bontyurls.csv') do |url|
url.chomp!
begin
doc = Nokogiri.HTML(open(url))
h1 = doc.at('h1').text.strip
h1 = doc.at('title').text.strip.sub(/^Bontrager: /i, '') if (h1.empty?)
csv << [url, h1]
rescue OpenURI::HTTPError => e
csv << [url, e.message]
end
end
end
您声明@model\u名称,但尝试将其推入@model\u名称,这就是它为零的原因。@pguardiario,在OP的问题中是
wb
。有时CSV数据包含8位字符,所以我保留了该设置。
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://bontrager.com/model/09124"))
doc.xpath('//h1').each do |node|
puts node.text
end
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url model_names]
}
CSV.open('bonty_test_urls_results.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('bontyurls.csv') do |url|
url.chomp!
begin
doc = Nokogiri.HTML(open(url))
h1 = doc.at('h1').text.strip
h1 = doc.at('title').text.strip.sub(/^Bontrager: /i, '') if (h1.empty?)
csv << [url, h1]
rescue OpenURI::HTTPError => e
csv << [url, e.message]
end
end
end
url,model_names
http://bontrager.com/model/11383,Road TLR Conversion Kit (Model #11383)
http://bontrager.com/model/01740,404 File Not Found
http://bontrager.com/model/09595,RXL Road (Model #09595)