Ruby 如何刮取<;李>;和孩子们

Ruby 如何刮取<;李>;和孩子们,ruby,web-scraping,nokogiri,Ruby,Web Scraping,Nokogiri,我正在尝试刮取标记及其内部的内容 HTML看起来像: <div class="insurancesAccepted"> <h4>What insurance does he accept?*</h4> <ul class="noBottomMargin"> <li class="first"><span>Aetna</span>

我正在尝试刮取
  • 标记及其内部的内容

    HTML看起来像:

     <div class="insurancesAccepted">
       <h4>What insurance does he accept?*</h4>
       <ul class="noBottomMargin">
          <li class="first"><span>Aetna</span></li>
          <li>
             <a title="See accepted plans" class="insurancePlanToggle arrowUp">AvMed</a>
             <ul style="display: block;" class="insurancePlanList">
                <li class="last first">Open Access</li>
             </ul>
          </li>
          <li>
             <a title="See accepted plans" class="insurancePlanToggle arrowUp">Blue Cross Blue Shield</a>
             <ul style="display: block;" class="insurancePlanList">
                <li class="last first">Blue Card PPO</li>
             </ul>
          </li>
          <li>
             <a title="See accepted plans" class="insurancePlanToggle arrowUp">Cigna</a>
             <ul style="display: block;" class="insurancePlanList">
                <li class="first">Cigna HMO</li>
                <li>Cigna PPO</li>
                <li class="last">Great West Healthcare-Cigna PPO</li>
             </ul>
          </li>
          <li class="last">
             <a title="See accepted plans" class="insurancePlanToggle arrowUp">Empire Blue Cross Blue Shield</a>
             <ul style="display: block;" class="insurancePlanList">
                <li class="last first">Empire Blue Cross Blue Shield HMO</li>
             </ul>
          </li>
       </ul>
      </div>
    

    它一次显示所有
  • 文本。我希望使用关系参数同时删除“AvMed”和“Open Access”,这样我就可以将其插入MySQL表中引用。

    问题在于
    doc.css('.insurancesAccepted li')
    匹配所有嵌套列表项,而不仅仅是直接子体。要仅匹配直系后代,应使用
    parent>child
    CSS规则。要完成任务,您需要仔细组合迭代的结果:

    doc = Nokogiri::HTML(html)
    result = doc.css('div.insurancesAccepted > ul > li').each do |li|
      chapter = li.css('span').text.strip
      section = li.css('a').text.strip
      subsections = li.css('ul > li').map(&:text).map(&:strip)
    
      puts "#{chapter} ⇒ [ #{section} ⇒ [ #{subsections.join(', ')} ] ]"
      puts '=' * 40
    end
    
    导致:

    # Aetna ⇒ [  ⇒ [  ] ]
    # ========================================
    #  ⇒ [ AvMed ⇒ [ Open Access ] ]
    # ========================================
    #  ⇒ [ Blue Cross Blue Shield ⇒ [ Blue Card PPO ] ]
    # ========================================
    #  ⇒ [ Cigna ⇒ [ Cigna HMO, Cigna PPO, Great West Healthcare-Cigna PPO ] ]
    # ========================================
    #  ⇒ [ Empire Blue Cross Blue Shield ⇒ [ Empire Blue Cross Blue Shield HMO ] ]
    # ========================================
    

    hi-its抛出以下错误“语法错误,意外$end,应为关键字_end p”#{chapter}”⇒ [#{section}⇒ [#{子节.连接(',')}]“”
    # Aetna ⇒ [  ⇒ [  ] ]
    # ========================================
    #  ⇒ [ AvMed ⇒ [ Open Access ] ]
    # ========================================
    #  ⇒ [ Blue Cross Blue Shield ⇒ [ Blue Card PPO ] ]
    # ========================================
    #  ⇒ [ Cigna ⇒ [ Cigna HMO, Cigna PPO, Great West Healthcare-Cigna PPO ] ]
    # ========================================
    #  ⇒ [ Empire Blue Cross Blue Shield ⇒ [ Empire Blue Cross Blue Shield HMO ] ]
    # ========================================