Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ruby/23.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Ruby、Nokogiri和;机械化网页中的java Cookie链接_Ruby_Cookies_Nokogiri_Mechanize_Scrape - Fatal编程技术网

使用Ruby、Nokogiri和;机械化网页中的java Cookie链接

使用Ruby、Nokogiri和;机械化网页中的java Cookie链接,ruby,cookies,nokogiri,mechanize,scrape,Ruby,Cookies,Nokogiri,Mechanize,Scrape,各位 我需要解析一个网页,其中每个链接都设置了java cookies。我可以解析正常的搜索,每个产品都会显示出来并导入mysql数据库 我能够从搜索结果中提取每个产品及其元素,代码如下: 这就是我所拥有的: 需要“rubygems” 需要“记录器” 需要“机械化” 需要“mysql2” agent=WWW::Mechanize.new{| a | a.log=Logger.new(STDERR)} #agent.set_proxy('a-proxy','8080') agent.read\u超

各位

我需要解析一个网页,其中每个链接都设置了java cookies。我可以解析正常的搜索,每个产品都会显示出来并导入mysql数据库

我能够从搜索结果中提取每个产品及其元素,代码如下:

这就是我所拥有的:
需要“rubygems”
需要“记录器”
需要“机械化”
需要“mysql2”
agent=WWW::Mechanize.new{| a | a.log=Logger.new(STDERR)}
#agent.set_proxy('a-proxy','8080')
agent.read\u超时=60
def add_cookie(代理、uri、cookie)
uri=uri.parse(uri)
Mechanize::Cookie.parse(uri,Cookie)do | Cookie|
agent.cookie\u jar.add(uri,cookie)
结束
结束
#获取主页
page=agent.get“http://www.site.com.mx"
#获取登录表单
form=page.forms.first
form.correo_Ingrear=“用户”
form.password=“password”
#提交登录表单
page=agent.submit表单
#解析cookies
myarray=page.body.scan(/SetCookie\(\“(.+)\”,\“(.+)\”)/)
#设置会话cookies
myarray.each do|项|
添加cookie(代理,'http://www.site.com.mx“{item[0]}={item[1]};path=/;domain=www.site.com.mx”)
结束
#每页显示1000个搜索结果
添加cookie(代理,'http://www.site.com.mx“,“tampag=1000;path=/;domain=www.site.com.mx”)
#订单结果
添加cookie(代理,'http://www.site.com.mx“,“orden_articulos=existencias asc;path=/;domain=www.site.com.mx”)
#截面结果
添加cookie(代理,'http://www.site.com.mx“,“codigoseccion_buscar=14;path=/;domain=www.site.com.mx”)
#获取主页
page=agent.get“http://www.site.com.mx/tienda/index.php"
搜索表单=page.forms.first
搜索结果=代理。提交搜索表格
doc=Nokogiri::HTML(search_result.body)
行=doc.css(“table.articulos tr”)
i=0
详细信息=行。收集do |行|
细节={}
[
[:sku,'td[3]/text()'],
[:desc,'td[4]/text(),
[:数量,'td[5]/text(),
[:qty2,'td[5]/p/b/text(),
[:price,'td[6]/text()']
].collect do |名称,xpath|
detail[name]=row.at_xpath(xpath).to_.strip
结束
i=i+1
细节
结束
#遍历分页器链接
links=doc.css(“a.paginar”).map{l}”http://www.site.com.mx#{l['href']}}.uniq!
links.each do|l|
page=agent.get l
doc=Nokogiri::HTML(page.body)
行=doc.css(“table.articulos tr”)
行。每行|
细节={}
[
[:sku,'td[3]/text()'],
[:desc,'td[4]/text(),
[:数量,'td[5]/text(),
[:qty2,'td[5]/p/b/text(),
[:price,'td[6]/text()']
].collect do |名称,xpath|
detail[name]=row.at_xpath(xpath).to_.strip
结束
详细信息“localhost”、:username=>“myusername”、:password=>“mypassword”、:database=>“mydatabase”)
详细信息。每个do|
如果d[:sku]!=""
价格=d[:价格]。拆分
如果价格[1]=“D”
货币=144
其他的
货币=168
结束
成本=价格[0]。gsub(“,”,“)。至
如果d[:数量]==“”
数量=d[:qty2]
其他的
数量=d[:数量]
结束
results=client.query(“从jos_vm_product中选择*,其中product_sku='{d[:sku]}'LIMIT 1;”)
如果results.count==1
产品=结果。首先
client.query(“UPDATE jos_vm_product SET product_sku=”#{d[:sku]})、product_name=”#{d[:desc]}、product#desc=”#{d[:desc]})、product_库存=“#{qty}”,其中product id=
#{product['product_id']};”)
query(“UPDATE jos_vm_product_price SET product_price='.{cost}',product_currency='.{currency}',其中product_id='.{product['product_id']}'))
其他的
查询(“在jos#U vm#U产品(产品#sku、产品#名称、产品#说明、产品#库存)中插入值('{d[:sku]}','#{d[:desc]}','#{d[:desc]}','#{qty}'))
last\u id=client.last\u id
查询(“插入到jos_vm_product_price(product_id,product_price,product_currency)值('{last_id}','{cost}','{currency}'))
结束
结束
结束
现在我不想搜索,我想从类别列表中解析:
链接到主页:http://www.site.com.mx/tienda/articulos.php?opcion=lineas&seccion_mostrar=11 这显示了这样一个表(所有内容都包含链接) 顶级名称:ACCESORIOS是指向ACCESORIOS类别的链接,下面列出的粗体名称是子类别,粗体名称下面列出的名称是品牌。如果我点击ACCESORIOS,它会显示每个品牌和每个子类别的混淆,等等

意外事故
Accesorios多媒体(6)
墨西哥学院(5所)、曼哈顿学院(1所)
Accesorios P/impres。文塔角(1)
爱普生公司(1)
Accesorios Para Cableados De Patch Panel(1)
INTELLINET网络解决方案(1)
Accesorios Para Camaras Digitales(1)
曼哈顿(1)
电子计算机辅助(32)
墨西哥学院(2所)、通用学院(1所)、曼哈顿学院(28所)、塔格斯学院(1所)
便携式计算机辅助系统(60)
墨西哥ACTECK(3)、GENIUS(2)、HP Commercial(2)、HP Impression(1)、曼哈顿(17)、完美选择(32)、SOLIDEX(1)、TARGUS(1)、科技区(1)
Accesorios Para Ipod(3)
墨西哥学院(1),完美选择(2)
Accesorios
    require 'rubygems'
    require 'logger'
    require 'mechanize'
    require 'mysql2'
    
    agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
    #agent.set_proxy('a-proxy', '8080')
    agent.read_timeout = 60
    
    def add_cookie(agent, uri, cookie)
      uri = URI.parse(uri)
      Mechanize::Cookie.parse(uri, cookie) do |cookie|
        agent.cookie_jar.add(uri, cookie)
      end
    end
    
    
    # get main page
    page = agent.get "http://www.site.com.mx"
    
    # get login form
    form = page.forms.first
    form.correo_ingresar = "user"
    form.password = "password"
    
    # submit login form
    page = agent.submit form
    
    # parse cookies
    myarray = page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)
    
    # set session cookies
    myarray.each do |item|
      add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx")
    end
    # show 1000 search results per page
    add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx")
    
    # order results
    add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx")
    
    # section results
    add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx")
    
    # get main page
    page = agent.get "http://www.site.com.mx/tienda/index.php"
    
    search_form = page.forms.first
    
    search_result = agent.submit search_form
    
    doc = Nokogiri::HTML(search_result.body)
    
    rows = doc.css("table.articulos tr")
    
    i = 0
    details = rows.collect do |row|
      detail = {}
      [
        [:sku, 'td[3]/text()'],
        [:desc, 'td[4]/text()'],
        [:qty, 'td[5]/text()'],
        [:qty2, 'td[5]/p/b/text()'],
        [:price, 'td[6]/text()']
      ].collect do |name, xpath|
        detail[name] = row.at_xpath(xpath).to_s.strip
      end
      i = i + 1
      detail
    end
    
    # walk through paginator links
    links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq!
    
    links.each do |l|
        page = agent.get l
    
        doc = Nokogiri::HTML(page.body)
    
        rows = doc.css("table.articulos tr")
    
        rows.each do |row|
            detail = {}
            [
                    [:sku, 'td[3]/text()'],
                    [:desc, 'td[4]/text()'],
                    [:qty, 'td[5]/text()'],
                    [:qty2, 'td[5]/p/b/text()'],
                    [:price, 'td[6]/text()']
            ].collect do |name, xpath|
                    detail[name] = row.at_xpath(xpath).to_s.strip
            end
            details << detail
        end
    end
    
    # update db
    client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase")
    
    details.each do |d|
        if d[:sku] != ""
            price = d[:price].split
    
            if price[1] == "D"
                currency = 144
            else
                currency = 168
            end
    
            cost = price[0].gsub(",", "").to_f
    
            if d[:qty] == ""
                qty = d[:qty2]
            else
                qty = d[:qty]
            end 
    
            results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;")
            if results.count == 1
                product = results.first
    
                            client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id = 
    #{product['product_id']};")
    
                client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';")
            else
                client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');")
                last_id = client.last_id
    
                client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});")
            end
        end
    end
    <table width="95%" cellspacing="0" cellpadding="3" border="0">
    <tbody>
    <tr>
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td>
    </tr>
    <tr>
    <td width="20" valign="top" align="left"></td>
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
    </td>
    </tr>
    </tbody>
    </table>
    # set cookies
    add_cookie(agent, 'http://www.site.com.mx', "codigoseccion_buscar=11; path=/; domain=www.site.com.mx")

    add_cookie(agent, 'http://www.site.com.mx', "codigolinea_buscar=; path=/; domain=www.site.com.mx")

    add_cookie(agent, 'http://www.site.com.mx', "codigomarca_buscar=; path=/; domain=www.site.com.mx")

    add_cookie(agent, 'http://www.site.com.mx', "textobuscar=; path=/; domain=www.site.com.mx")
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>