使用Ruby、Nokogiri和;机械化网页中的java Cookie链接
各位 我需要解析一个网页,其中每个链接都设置了java cookies。我可以解析正常的搜索,每个产品都会显示出来并导入mysql数据库 我能够从搜索结果中提取每个产品及其元素,代码如下: 这就是我所拥有的:使用Ruby、Nokogiri和;机械化网页中的java Cookie链接,ruby,cookies,nokogiri,mechanize,scrape,Ruby,Cookies,Nokogiri,Mechanize,Scrape,各位 我需要解析一个网页,其中每个链接都设置了java cookies。我可以解析正常的搜索,每个产品都会显示出来并导入mysql数据库 我能够从搜索结果中提取每个产品及其元素,代码如下: 这就是我所拥有的: 需要“rubygems” 需要“记录器” 需要“机械化” 需要“mysql2” agent=WWW::Mechanize.new{| a | a.log=Logger.new(STDERR)} #agent.set_proxy('a-proxy','8080') agent.read\u超
需要“rubygems”
需要“记录器”
需要“机械化”
需要“mysql2”
agent=WWW::Mechanize.new{| a | a.log=Logger.new(STDERR)}
#agent.set_proxy('a-proxy','8080')
agent.read\u超时=60
def add_cookie(代理、uri、cookie)
uri=uri.parse(uri)
Mechanize::Cookie.parse(uri,Cookie)do | Cookie|
agent.cookie\u jar.add(uri,cookie)
结束
结束
#获取主页
page=agent.get“http://www.site.com.mx"
#获取登录表单
form=page.forms.first
form.correo_Ingrear=“用户”
form.password=“password”
#提交登录表单
page=agent.submit表单
#解析cookies
myarray=page.body.scan(/SetCookie\(\“(.+)\”,\“(.+)\”)/)
#设置会话cookies
myarray.each do|项|
添加cookie(代理,'http://www.site.com.mx“{item[0]}={item[1]};path=/;domain=www.site.com.mx”)
结束
#每页显示1000个搜索结果
添加cookie(代理,'http://www.site.com.mx“,“tampag=1000;path=/;domain=www.site.com.mx”)
#订单结果
添加cookie(代理,'http://www.site.com.mx“,“orden_articulos=existencias asc;path=/;domain=www.site.com.mx”)
#截面结果
添加cookie(代理,'http://www.site.com.mx“,“codigoseccion_buscar=14;path=/;domain=www.site.com.mx”)
#获取主页
page=agent.get“http://www.site.com.mx/tienda/index.php"
搜索表单=page.forms.first
搜索结果=代理。提交搜索表格
doc=Nokogiri::HTML(search_result.body)
行=doc.css(“table.articulos tr”)
i=0
详细信息=行。收集do |行|
细节={}
[
[:sku,'td[3]/text()'],
[:desc,'td[4]/text(),
[:数量,'td[5]/text(),
[:qty2,'td[5]/p/b/text(),
[:price,'td[6]/text()']
].collect do |名称,xpath|
detail[name]=row.at_xpath(xpath).to_.strip
结束
i=i+1
细节
结束
#遍历分页器链接
links=doc.css(“a.paginar”).map{l}”http://www.site.com.mx#{l['href']}}.uniq!
links.each do|l|
page=agent.get l
doc=Nokogiri::HTML(page.body)
行=doc.css(“table.articulos tr”)
行。每行|
细节={}
[
[:sku,'td[3]/text()'],
[:desc,'td[4]/text(),
[:数量,'td[5]/text(),
[:qty2,'td[5]/p/b/text(),
[:price,'td[6]/text()']
].collect do |名称,xpath|
detail[name]=row.at_xpath(xpath).to_.strip
结束
详细信息“localhost”、:username=>“myusername”、:password=>“mypassword”、:database=>“mydatabase”)
详细信息。每个do|
如果d[:sku]!=""
价格=d[:价格]。拆分
如果价格[1]=“D”
货币=144
其他的
货币=168
结束
成本=价格[0]。gsub(“,”,“)。至
如果d[:数量]==“”
数量=d[:qty2]
其他的
数量=d[:数量]
结束
results=client.query(“从jos_vm_product中选择*,其中product_sku='{d[:sku]}'LIMIT 1;”)
如果results.count==1
产品=结果。首先
client.query(“UPDATE jos_vm_product SET product_sku=”#{d[:sku]})、product_name=”#{d[:desc]}、product#desc=”#{d[:desc]})、product_库存=“#{qty}”,其中product id=
#{product['product_id']};”)
query(“UPDATE jos_vm_product_price SET product_price='.{cost}',product_currency='.{currency}',其中product_id='.{product['product_id']}'))
其他的
查询(“在jos#U vm#U产品(产品#sku、产品#名称、产品#说明、产品#库存)中插入值('{d[:sku]}','#{d[:desc]}','#{d[:desc]}','#{qty}'))
last\u id=client.last\u id
查询(“插入到jos_vm_product_price(product_id,product_price,product_currency)值('{last_id}','{cost}','{currency}'))
结束
结束
结束
现在我不想搜索,我想从类别列表中解析:链接到主页:http://www.site.com.mx/tienda/articulos.php?opcion=lineas&seccion_mostrar=11 这显示了这样一个表(所有内容都包含链接) 顶级名称:ACCESORIOS是指向ACCESORIOS类别的链接,下面列出的粗体名称是子类别,粗体名称下面列出的名称是品牌。如果我点击ACCESORIOS,它会显示每个品牌和每个子类别的混淆,等等 意外事故
Accesorios多媒体(6)
墨西哥学院(5所)、曼哈顿学院(1所)
Accesorios P/impres。文塔角(1)
爱普生公司(1)
Accesorios Para Cableados De Patch Panel(1)
INTELLINET网络解决方案(1)
Accesorios Para Camaras Digitales(1)
曼哈顿(1)
电子计算机辅助(32)
墨西哥学院(2所)、通用学院(1所)、曼哈顿学院(28所)、塔格斯学院(1所)
便携式计算机辅助系统(60)
墨西哥ACTECK(3)、GENIUS(2)、HP Commercial(2)、HP Impression(1)、曼哈顿(17)、完美选择(32)、SOLIDEX(1)、TARGUS(1)、科技区(1)
Accesorios Para Ipod(3)
墨西哥学院(1),完美选择(2)
Accesorios
require 'rubygems'
require 'logger'
require 'mechanize'
require 'mysql2'
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
#agent.set_proxy('a-proxy', '8080')
agent.read_timeout = 60
def add_cookie(agent, uri, cookie)
uri = URI.parse(uri)
Mechanize::Cookie.parse(uri, cookie) do |cookie|
agent.cookie_jar.add(uri, cookie)
end
end
# get main page
page = agent.get "http://www.site.com.mx"
# get login form
form = page.forms.first
form.correo_ingresar = "user"
form.password = "password"
# submit login form
page = agent.submit form
# parse cookies
myarray = page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)
# set session cookies
myarray.each do |item|
add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx")
end
# show 1000 search results per page
add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx")
# order results
add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx")
# section results
add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx")
# get main page
page = agent.get "http://www.site.com.mx/tienda/index.php"
search_form = page.forms.first
search_result = agent.submit search_form
doc = Nokogiri::HTML(search_result.body)
rows = doc.css("table.articulos tr")
i = 0
details = rows.collect do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
i = i + 1
detail
end
# walk through paginator links
links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq!
links.each do |l|
page = agent.get l
doc = Nokogiri::HTML(page.body)
rows = doc.css("table.articulos tr")
rows.each do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
details << detail
end
end
# update db
client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase")
details.each do |d|
if d[:sku] != ""
price = d[:price].split
if price[1] == "D"
currency = 144
else
currency = 168
end
cost = price[0].gsub(",", "").to_f
if d[:qty] == ""
qty = d[:qty2]
else
qty = d[:qty]
end
results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;")
if results.count == 1
product = results.first
client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id =
#{product['product_id']};")
client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';")
else
client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');")
last_id = client.last_id
client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});")
end
end
end
<table width="95%" cellspacing="0" cellpadding="3" border="0">
<tbody>
<tr>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td>
</tr>
<tr>
<td width="20" valign="top" align="left"></td>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br>
<a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br>
<a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br>
<a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
</td>
</tr>
</tbody>
</table>
# set cookies
add_cookie(agent, 'http://www.site.com.mx', "codigoseccion_buscar=11; path=/; domain=www.site.com.mx")
add_cookie(agent, 'http://www.site.com.mx', "codigolinea_buscar=; path=/; domain=www.site.com.mx")
add_cookie(agent, 'http://www.site.com.mx', "codigomarca_buscar=; path=/; domain=www.site.com.mx")
add_cookie(agent, 'http://www.site.com.mx', "textobuscar=; path=/; domain=www.site.com.mx")
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>