Arrays 如何去除数组中的幻影行?

Arrays 如何去除数组中的幻影行?,arrays,ruby,hash,nokogiri,Arrays,Ruby,Hash,Nokogiri,我正在用httparty抓取一堆表,然后用nokogiri解析响应。一切都很好,但我在顶部看到了一行幻影: require 'nokogiri' require 'httparty' require 'byebug' def scraper url = "https://github.com/public-apis/public-apis" parsed_page = Nokogiri::HTML(HTTParty.get(url)) # Get categories f

我正在用httparty抓取一堆表,然后用nokogiri解析响应。一切都很好,但我在顶部看到了一行幻影:

require 'nokogiri'
require 'httparty'
require 'byebug'
def scraper
    url = "https://github.com/public-apis/public-apis"
    parsed_page = Nokogiri::HTML(HTTParty.get(url))
    # Get categories from the ul at the top
    categories = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/ul/li/a')
    # Get all tables from the page
    tables = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/table')
    rows = []
    # Acting on one first for testing before making it dynamic 
    tables[0].search('tr').each do |tr|
        cells = tr.search('td')
        link = ''
        values = []
        row = {
            'name' => '',
            'description' => '',
            'auth' => '',
            'https' => '',
            'cors' => '',
            'category' => '',
            'url' => ''
        }
        cells.css('a').each do |a|
            link += a['href']
        end
        cells.each do |cell|
            values << cell.text
        end
        values << categories[0].text
        values << link
        rows << row.keys.zip(values).to_h
    end
    puts rows
end
scraper

第一行来自哪里?

您看到的第一行很可能是标题行。标题行使用
而不是
。这意味着
cells=tr.search('td')
将是标题行的空集合


在大多数情况下,标题行放在
中,数据行放在
中。因此,与其执行
表[0]。搜索('tr')
,不如执行
表[0]。搜索('tbody tr')
,它只选择
标记中的行。

您看到的第一行很可能是标题行。标题行使用
而不是
。这意味着
cells=tr.search('td')
将是标题行的空集合


在大多数情况下,标题行放在
中,数据行放在
中。因此,与其执行
表[0]。搜索('tr')
,您可以执行
表[0]。搜索('tbody tr')
,它只选择
标记中的行。

您的代码可以更简单、更具弹性:

思考这个问题:

require 'nokogiri'
require 'httparty'

URL = 'https://github.com/public-apis/public-apis'
FIELDS = %w[name description auth https cors category url]

doc = Nokogiri::HTML(HTTParty.get(URL))

category = doc.at('article li a').text

rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
  values = tr.search('td').map(&:text)
  link = tr.at('a')['href']
  Hash[
    FIELDS.zip(values + [category, link])
  ]
}
其结果是:

puts rows

# >> {"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://alexwohlbruck.github.io/cat-facts/"}
# >> {"name"=>"Cats", "description"=>"Pictures of cats from Tumblr", "auth"=>"apiKey", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://docs.thecatapi.com/"}
# >> {"name"=>"Dogs", "description"=>"Based on the Stanford Dogs Dataset", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://dog.ceo/dog-api/"}
# >> {"name"=>"HTTPCat", "description"=>"Cat for every HTTP Status", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://http.cat/"}
# >> {"name"=>"IUCN", "description"=>"IUCN Red List of Threatened Species", "auth"=>"apiKey", "https"=>"No", "cors"=>"Unknown", "category"=>"Animals", "url"=>"http://apiv3.iucnredlist.org/api/v3/docs"}
# >> {"name"=>"Movebank", "description"=>"Movement and Migration data of animals", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://github.com/movebank/movebank-api-doc"}
# >> {"name"=>"Petfinder", "description"=>"Adoption", "auth"=>"OAuth", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://www.petfinder.com/developers/v2/docs/"}
# >> {"name"=>"PlaceGOAT", "description"=>"Placeholder goat images", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://placegoat.com/"}
# >> {"name"=>"RandomCat", "description"=>"Random pictures of cats", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://aws.random.cat/meow"}
# >> {"name"=>"RandomDog", "description"=>"Random pictures of dogs", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://random.dog/woof.json"}
# >> {"name"=>"RandomFox", "description"=>"Random pictures of foxes", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://randomfox.ca/floof/"}
# >> {"name"=>"RescueGroups", "description"=>"Adoption", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://userguide.rescuegroups.org/display/APIDG/API+Developers+Guide+Home"}
# >> {"name"=>"Shibe.Online", "description"=>"Random pictures of Shibu Inu, cats or birds", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"http://shibe.online/"}
您的代码存在以下问题:

  • 使用与第二个相同,因为第二个更干净,因此视觉噪音更小

    与文档中介绍的
    at
    相比,
    search
    返回的内容还有其他更细微的差异。我强烈建议阅读并尝试他们的例子,因为知道什么时候该用什么可以省去你的头疼

  • 依赖绝对XPath选择器:绝对选择器非常脆弱。对HTML的任何更改都极有可能中断。相反,找到有用的节点来检查它们是否唯一,并让解析器找到它们

    使用CSS选择器
    “article li a”
    跳过所有节点,直到找到“article”节点,在其中查找子节点“li”,并在“a”之后查找。您可以使用XPath做同样的事情,但它在视觉上很混乱。我非常喜欢让我的代码尽可能容易阅读和理解

    类似地,
    at('article table')
    查找“article”节点下的第一个表,然后
    search('tr')
    只查找该表中的嵌入行

    因为您想跳过表头
    [1..-1]
    切片节点集并跳过第一行

  • 使构建结构更容易:

    rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
    
    通过该行循环一次将字段分配给

    与每个“td”节点文本的节点集文本一起分配

  • 通过使用构造函数并传入一个键/值对数组,可以轻松构建哈希

    FIELDS.zip(values + [category, link])
    
    从单元格中获取值并添加第二个数组,该数组包含行中的类别和链接


我的示例代码基本上是相同的模板,每次我用表刮一页。虽然会有一些细微的区别,但这是一个在表上的循环,提取单元格并将其转换为散列。甚至可以在一个写得很干净的表上,从表第一行的单元格文本中自动获取散列的键。

您的代码可以更简单、更灵活:

思考这个问题:

require 'nokogiri'
require 'httparty'

URL = 'https://github.com/public-apis/public-apis'
FIELDS = %w[name description auth https cors category url]

doc = Nokogiri::HTML(HTTParty.get(URL))

category = doc.at('article li a').text

rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
  values = tr.search('td').map(&:text)
  link = tr.at('a')['href']
  Hash[
    FIELDS.zip(values + [category, link])
  ]
}
其结果是:

puts rows

# >> {"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://alexwohlbruck.github.io/cat-facts/"}
# >> {"name"=>"Cats", "description"=>"Pictures of cats from Tumblr", "auth"=>"apiKey", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://docs.thecatapi.com/"}
# >> {"name"=>"Dogs", "description"=>"Based on the Stanford Dogs Dataset", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://dog.ceo/dog-api/"}
# >> {"name"=>"HTTPCat", "description"=>"Cat for every HTTP Status", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://http.cat/"}
# >> {"name"=>"IUCN", "description"=>"IUCN Red List of Threatened Species", "auth"=>"apiKey", "https"=>"No", "cors"=>"Unknown", "category"=>"Animals", "url"=>"http://apiv3.iucnredlist.org/api/v3/docs"}
# >> {"name"=>"Movebank", "description"=>"Movement and Migration data of animals", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://github.com/movebank/movebank-api-doc"}
# >> {"name"=>"Petfinder", "description"=>"Adoption", "auth"=>"OAuth", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://www.petfinder.com/developers/v2/docs/"}
# >> {"name"=>"PlaceGOAT", "description"=>"Placeholder goat images", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://placegoat.com/"}
# >> {"name"=>"RandomCat", "description"=>"Random pictures of cats", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://aws.random.cat/meow"}
# >> {"name"=>"RandomDog", "description"=>"Random pictures of dogs", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://random.dog/woof.json"}
# >> {"name"=>"RandomFox", "description"=>"Random pictures of foxes", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://randomfox.ca/floof/"}
# >> {"name"=>"RescueGroups", "description"=>"Adoption", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://userguide.rescuegroups.org/display/APIDG/API+Developers+Guide+Home"}
# >> {"name"=>"Shibe.Online", "description"=>"Random pictures of Shibu Inu, cats or birds", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"http://shibe.online/"}
您的代码存在以下问题:

  • 使用与第二个相同,因为第二个更干净,因此视觉噪音更小

    与文档中介绍的
    at
    相比,
    search
    返回的内容还有其他更细微的差异。我强烈建议阅读并尝试他们的例子,因为知道什么时候该用什么可以省去你的头疼

  • 依赖绝对XPath选择器:绝对选择器非常脆弱。对HTML的任何更改都极有可能中断。相反,找到有用的节点来检查它们是否唯一,并让解析器找到它们

    使用CSS选择器
    “article li a”
    跳过所有节点,直到找到“article”节点,在其中查找子节点“li”,并在“a”之后查找。您可以使用XPath做同样的事情,但它在视觉上很混乱。我非常喜欢让我的代码尽可能容易阅读和理解

    类似地,
    at('article table')
    查找“article”节点下的第一个表,然后
    search('tr')
    只查找该表中的嵌入行

    因为您想跳过表头
    [1..-1]
    切片节点集并跳过第一行

  • 使构建结构更容易:

    rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
    
    通过该行循环一次将字段分配给

    与每个“td”节点文本的节点集文本一起分配

  • 通过使用构造函数并传入一个键/值对数组,可以轻松构建哈希

    FIELDS.zip(values + [category, link])
    
    从单元格中获取值并添加第二个数组,该数组包含行中的类别和链接


我的示例代码基本上是相同的模板,每次我用表刮一页。虽然会有一些细微的区别,但这是一个在表上的循环,提取单元格并将其转换为散列。在一个写得很干净的表上,甚至可以从表第一行的单元格文本中自动抓取散列的键。

Welp,我确信我已经瞄准了tbody,这很尴尬。非常感谢
thead
tbody
,虽然它们应该存在,但它们并不存在于web上的大多数文档中,因为这些文档是在添加之前创建的,或者文档的创建者并不在意。浏览器会自动添加它们,并在我们查看sou时显示它们