Arrays 如何去除数组中的幻影行？_Arrays_Ruby_Hash_Nokogiri

Arrays 如何去除数组中的幻影行？

arrays ruby hash

Arrays 如何去除数组中的幻影行？,arrays,ruby,hash,nokogiri,Arrays,Ruby,Hash,Nokogiri,我正在用httparty抓取一堆表，然后用nokogiri解析响应。一切都很好，但我在顶部看到了一行幻影： require 'nokogiri' require 'httparty' require 'byebug' def scraper url = "https://github.com/public-apis/public-apis" parsed_page = Nokogiri::HTML(HTTParty.get(url)) # Get categories f

我正在用httparty抓取一堆表，然后用nokogiri解析响应。一切都很好，但我在顶部看到了一行幻影：

require 'nokogiri'
require 'httparty'
require 'byebug'
def scraper
    url = "https://github.com/public-apis/public-apis"
    parsed_page = Nokogiri::HTML(HTTParty.get(url))
    # Get categories from the ul at the top
    categories = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/ul/li/a')
    # Get all tables from the page
    tables = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/table')
    rows = []
    # Acting on one first for testing before making it dynamic 
    tables[0].search('tr').each do |tr|
        cells = tr.search('td')
        link = ''
        values = []
        row = {
            'name' => '',
            'description' => '',
            'auth' => '',
            'https' => '',
            'cors' => '',
            'category' => '',
            'url' => ''
        }
        cells.css('a').each do |a|
            link += a['href']
        end
        cells.each do |cell|
            values << cell.text
        end
        values << categories[0].text
        values << link
        rows << row.keys.zip(values).to_h
    end
    puts rows
end
scraper

第一行来自哪里？

您看到的第一行很可能是标题行。标题行使用

而不是

。这意味着

cells=tr.search（'td'）

将是标题行的空集合

在大多数情况下，标题行放在

中，数据行放在

中。因此，与其执行

表[0]。搜索（'tr'）

，不如执行

表[0]。搜索（'tbody tr'）

，它只选择

标记中的行。

您看到的第一行很可能是标题行。标题行使用

而不是

。这意味着

cells=tr.search（'td'）

将是标题行的空集合

在大多数情况下，标题行放在

中，数据行放在

中。因此，与其执行

表[0]。搜索（'tr'）

，您可以执行

表[0]。搜索（'tbody tr'）

，它只选择

标记中的行。

您的代码可以更简单、更具弹性：

思考这个问题：

require 'nokogiri'
require 'httparty'

URL = 'https://github.com/public-apis/public-apis'
FIELDS = %w[name description auth https cors category url]

doc = Nokogiri::HTML(HTTParty.get(URL))

category = doc.at('article li a').text

rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
  values = tr.search('td').map(&:text)
  link = tr.at('a')['href']
  Hash[
    FIELDS.zip(values + [category, link])
  ]
}

其结果是：

puts rows

# >> {"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://alexwohlbruck.github.io/cat-facts/"}
# >> {"name"=>"Cats", "description"=>"Pictures of cats from Tumblr", "auth"=>"apiKey", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://docs.thecatapi.com/"}
# >> {"name"=>"Dogs", "description"=>"Based on the Stanford Dogs Dataset", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://dog.ceo/dog-api/"}
# >> {"name"=>"HTTPCat", "description"=>"Cat for every HTTP Status", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://http.cat/"}
# >> {"name"=>"IUCN", "description"=>"IUCN Red List of Threatened Species", "auth"=>"apiKey", "https"=>"No", "cors"=>"Unknown", "category"=>"Animals", "url"=>"http://apiv3.iucnredlist.org/api/v3/docs"}
# >> {"name"=>"Movebank", "description"=>"Movement and Migration data of animals", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://github.com/movebank/movebank-api-doc"}
# >> {"name"=>"Petfinder", "description"=>"Adoption", "auth"=>"OAuth", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://www.petfinder.com/developers/v2/docs/"}
# >> {"name"=>"PlaceGOAT", "description"=>"Placeholder goat images", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://placegoat.com/"}
# >> {"name"=>"RandomCat", "description"=>"Random pictures of cats", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://aws.random.cat/meow"}
# >> {"name"=>"RandomDog", "description"=>"Random pictures of dogs", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://random.dog/woof.json"}
# >> {"name"=>"RandomFox", "description"=>"Random pictures of foxes", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://randomfox.ca/floof/"}
# >> {"name"=>"RescueGroups", "description"=>"Adoption", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://userguide.rescuegroups.org/display/APIDG/API+Developers+Guide+Home"}
# >> {"name"=>"Shibe.Online", "description"=>"Random pictures of Shibu Inu, cats or birds", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"http://shibe.online/"}

您的代码存在以下问题：

使用与第二个相同，因为第二个更干净，因此视觉噪音更小
与文档中介绍的
```
at
```
相比，
```
search
```
返回的内容还有其他更细微的差异。我强烈建议阅读并尝试他们的例子，因为知道什么时候该用什么可以省去你的头疼
依赖绝对XPath选择器：绝对选择器非常脆弱。对HTML的任何更改都极有可能中断。相反，找到有用的节点来检查它们是否唯一，并让解析器找到它们
使用CSS选择器
```
“article li a”
```
跳过所有节点，直到找到“article”节点，在其中查找子节点“li”，并在“a”之后查找。您可以使用XPath做同样的事情，但它在视觉上很混乱。我非常喜欢让我的代码尽可能容易阅读和理解
类似地，
```
at（'article table'）
```
查找“article”节点下的第一个表，然后
```
search（'tr'）
```
只查找该表中的嵌入行
因为您想跳过表头
```
[1..-1]
```
切片节点集并跳过第一行
使构建结构更容易：
```
rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
```
通过该行循环一次将字段分配给
```
行
```
```
值
```
与每个“td”节点文本的节点集文本一起分配
通过使用构造函数并传入一个键/值对数组，可以轻松构建哈希
```
FIELDS.zip(values + [category, link])
```
从单元格中获取值并添加第二个数组，该数组包含行中的类别和链接

我的示例代码基本上是相同的模板，每次我用表刮一页。虽然会有一些细微的区别，但这是一个在表上的循环，提取单元格并将其转换为散列。甚至可以在一个写得很干净的表上，从表第一行的单元格文本中自动获取散列的键。

您的代码可以更简单、更灵活：

思考这个问题：

require 'nokogiri'
require 'httparty'

URL = 'https://github.com/public-apis/public-apis'
FIELDS = %w[name description auth https cors category url]

doc = Nokogiri::HTML(HTTParty.get(URL))

category = doc.at('article li a').text

rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
  values = tr.search('td').map(&:text)
  link = tr.at('a')['href']
  Hash[
    FIELDS.zip(values + [category, link])
  ]
}

其结果是：

puts rows

# >> {"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://alexwohlbruck.github.io/cat-facts/"}
# >> {"name"=>"Cats", "description"=>"Pictures of cats from Tumblr", "auth"=>"apiKey", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://docs.thecatapi.com/"}
# >> {"name"=>"Dogs", "description"=>"Based on the Stanford Dogs Dataset", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://dog.ceo/dog-api/"}
# >> {"name"=>"HTTPCat", "description"=>"Cat for every HTTP Status", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://http.cat/"}
# >> {"name"=>"IUCN", "description"=>"IUCN Red List of Threatened Species", "auth"=>"apiKey", "https"=>"No", "cors"=>"Unknown", "category"=>"Animals", "url"=>"http://apiv3.iucnredlist.org/api/v3/docs"}
# >> {"name"=>"Movebank", "description"=>"Movement and Migration data of animals", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://github.com/movebank/movebank-api-doc"}
# >> {"name"=>"Petfinder", "description"=>"Adoption", "auth"=>"OAuth", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://www.petfinder.com/developers/v2/docs/"}
# >> {"name"=>"PlaceGOAT", "description"=>"Placeholder goat images", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://placegoat.com/"}
# >> {"name"=>"RandomCat", "description"=>"Random pictures of cats", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://aws.random.cat/meow"}
# >> {"name"=>"RandomDog", "description"=>"Random pictures of dogs", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://random.dog/woof.json"}
# >> {"name"=>"RandomFox", "description"=>"Random pictures of foxes", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://randomfox.ca/floof/"}
# >> {"name"=>"RescueGroups", "description"=>"Adoption", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://userguide.rescuegroups.org/display/APIDG/API+Developers+Guide+Home"}
# >> {"name"=>"Shibe.Online", "description"=>"Random pictures of Shibu Inu, cats or birds", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"http://shibe.online/"}

您的代码存在以下问题：

使用与第二个相同，因为第二个更干净，因此视觉噪音更小
与文档中介绍的
```
at
```
相比，
```
search
```
返回的内容还有其他更细微的差异。我强烈建议阅读并尝试他们的例子，因为知道什么时候该用什么可以省去你的头疼
依赖绝对XPath选择器：绝对选择器非常脆弱。对HTML的任何更改都极有可能中断。相反，找到有用的节点来检查它们是否唯一，并让解析器找到它们
使用CSS选择器
```
“article li a”
```
跳过所有节点，直到找到“article”节点，在其中查找子节点“li”，并在“a”之后查找。您可以使用XPath做同样的事情，但它在视觉上很混乱。我非常喜欢让我的代码尽可能容易阅读和理解
类似地，
```
at（'article table'）
```
查找“article”节点下的第一个表，然后
```
search（'tr'）
```
只查找该表中的嵌入行
因为您想跳过表头
```
[1..-1]
```
切片节点集并跳过第一行
使构建结构更容易：
```
rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
```
通过该行循环一次将字段分配给
```
行
```
```
值
```
与每个“td”节点文本的节点集文本一起分配
通过使用构造函数并传入一个键/值对数组，可以轻松构建哈希
```
FIELDS.zip(values + [category, link])
```
从单元格中获取值并添加第二个数组，该数组包含行中的类别和链接

我的示例代码基本上是相同的模板，每次我用表刮一页。虽然会有一些细微的区别，但这是一个在表上的循环，提取单元格并将其转换为散列。在一个写得很干净的表上，甚至可以从表第一行的单元格文本中自动抓取散列的键。

Welp，我确信我已经瞄准了tbody，这很尴尬。非常感谢

thead

，

tbody

，虽然它们应该存在，但它们并不存在于web上的大多数文档中，因为这些文档是在添加之前创建的，或者文档的创建者并不在意。浏览器会自动添加它们，并在我们查看sou时显示它们