Arrays 如何去除数组中的幻影行?
我正在用httparty抓取一堆表,然后用nokogiri解析响应。一切都很好,但我在顶部看到了一行幻影:Arrays 如何去除数组中的幻影行?,arrays,ruby,hash,nokogiri,Arrays,Ruby,Hash,Nokogiri,我正在用httparty抓取一堆表,然后用nokogiri解析响应。一切都很好,但我在顶部看到了一行幻影: require 'nokogiri' require 'httparty' require 'byebug' def scraper url = "https://github.com/public-apis/public-apis" parsed_page = Nokogiri::HTML(HTTParty.get(url)) # Get categories f
require 'nokogiri'
require 'httparty'
require 'byebug'
def scraper
url = "https://github.com/public-apis/public-apis"
parsed_page = Nokogiri::HTML(HTTParty.get(url))
# Get categories from the ul at the top
categories = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/ul/li/a')
# Get all tables from the page
tables = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/table')
rows = []
# Acting on one first for testing before making it dynamic
tables[0].search('tr').each do |tr|
cells = tr.search('td')
link = ''
values = []
row = {
'name' => '',
'description' => '',
'auth' => '',
'https' => '',
'cors' => '',
'category' => '',
'url' => ''
}
cells.css('a').each do |a|
link += a['href']
end
cells.each do |cell|
values << cell.text
end
values << categories[0].text
values << link
rows << row.keys.zip(values).to_h
end
puts rows
end
scraper
第一行来自哪里?您看到的第一行很可能是标题行。标题行使用
而不是
。这意味着cells=tr.search('td')
将是标题行的空集合
在大多数情况下,标题行放在
中,数据行放在
中。因此,与其执行表[0]。搜索('tr')
,不如执行表[0]。搜索('tbody tr')
,它只选择
标记中的行。您看到的第一行很可能是标题行。标题行使用
而不是
。这意味着cells=tr.search('td')
将是标题行的空集合
在大多数情况下,标题行放在
中,数据行放在
中。因此,与其执行表[0]。搜索('tr')
,您可以执行表[0]。搜索('tbody tr')
,它只选择
标记中的行。您的代码可以更简单、更具弹性:
思考这个问题:
require 'nokogiri'
require 'httparty'
URL = 'https://github.com/public-apis/public-apis'
FIELDS = %w[name description auth https cors category url]
doc = Nokogiri::HTML(HTTParty.get(URL))
category = doc.at('article li a').text
rows = doc.at('article table').search('tr')[1..-1].map { |tr|
values = tr.search('td').map(&:text)
link = tr.at('a')['href']
Hash[
FIELDS.zip(values + [category, link])
]
}
其结果是:
puts rows
# >> {"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://alexwohlbruck.github.io/cat-facts/"}
# >> {"name"=>"Cats", "description"=>"Pictures of cats from Tumblr", "auth"=>"apiKey", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://docs.thecatapi.com/"}
# >> {"name"=>"Dogs", "description"=>"Based on the Stanford Dogs Dataset", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://dog.ceo/dog-api/"}
# >> {"name"=>"HTTPCat", "description"=>"Cat for every HTTP Status", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://http.cat/"}
# >> {"name"=>"IUCN", "description"=>"IUCN Red List of Threatened Species", "auth"=>"apiKey", "https"=>"No", "cors"=>"Unknown", "category"=>"Animals", "url"=>"http://apiv3.iucnredlist.org/api/v3/docs"}
# >> {"name"=>"Movebank", "description"=>"Movement and Migration data of animals", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://github.com/movebank/movebank-api-doc"}
# >> {"name"=>"Petfinder", "description"=>"Adoption", "auth"=>"OAuth", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://www.petfinder.com/developers/v2/docs/"}
# >> {"name"=>"PlaceGOAT", "description"=>"Placeholder goat images", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://placegoat.com/"}
# >> {"name"=>"RandomCat", "description"=>"Random pictures of cats", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://aws.random.cat/meow"}
# >> {"name"=>"RandomDog", "description"=>"Random pictures of dogs", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://random.dog/woof.json"}
# >> {"name"=>"RandomFox", "description"=>"Random pictures of foxes", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://randomfox.ca/floof/"}
# >> {"name"=>"RescueGroups", "description"=>"Adoption", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://userguide.rescuegroups.org/display/APIDG/API+Developers+Guide+Home"}
# >> {"name"=>"Shibe.Online", "description"=>"Random pictures of Shibu Inu, cats or birds", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"http://shibe.online/"}
您的代码存在以下问题:
- 使用与第二个相同,因为第二个更干净,因此视觉噪音更小
与文档中介绍的
相比,at
返回的内容还有其他更细微的差异。我强烈建议阅读并尝试他们的例子,因为知道什么时候该用什么可以省去你的头疼search
- 依赖绝对XPath选择器:绝对选择器非常脆弱。对HTML的任何更改都极有可能中断。相反,找到有用的节点来检查它们是否唯一,并让解析器找到它们
使用CSS选择器
跳过所有节点,直到找到“article”节点,在其中查找子节点“li”,并在“a”之后查找。您可以使用XPath做同样的事情,但它在视觉上很混乱。我非常喜欢让我的代码尽可能容易阅读和理解 类似地,“article li a”
查找“article”节点下的第一个表,然后at('article table')
只查找该表中的嵌入行 因为您想跳过表头search('tr')
切片节点集并跳过第一行[1..-1]
- 使构建结构更容易:
通过该行循环一次将字段分配给rows = doc.at('article table').search('tr')[1..-1].map { |tr|
行
与每个“td”节点文本的节点集文本一起分配值
- 通过使用构造函数并传入一个键/值对数组,可以轻松构建哈希
从单元格中获取值并添加第二个数组,该数组包含行中的类别和链接FIELDS.zip(values + [category, link])
我的示例代码基本上是相同的模板,每次我用表刮一页。虽然会有一些细微的区别,但这是一个在表上的循环,提取单元格并将其转换为散列。甚至可以在一个写得很干净的表上,从表第一行的单元格文本中自动获取散列的键。您的代码可以更简单、更灵活: 思考这个问题:
require 'nokogiri'
require 'httparty'
URL = 'https://github.com/public-apis/public-apis'
FIELDS = %w[name description auth https cors category url]
doc = Nokogiri::HTML(HTTParty.get(URL))
category = doc.at('article li a').text
rows = doc.at('article table').search('tr')[1..-1].map { |tr|
values = tr.search('td').map(&:text)
link = tr.at('a')['href']
Hash[
FIELDS.zip(values + [category, link])
]
}
其结果是:
puts rows
# >> {"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://alexwohlbruck.github.io/cat-facts/"}
# >> {"name"=>"Cats", "description"=>"Pictures of cats from Tumblr", "auth"=>"apiKey", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://docs.thecatapi.com/"}
# >> {"name"=>"Dogs", "description"=>"Based on the Stanford Dogs Dataset", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://dog.ceo/dog-api/"}
# >> {"name"=>"HTTPCat", "description"=>"Cat for every HTTP Status", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://http.cat/"}
# >> {"name"=>"IUCN", "description"=>"IUCN Red List of Threatened Species", "auth"=>"apiKey", "https"=>"No", "cors"=>"Unknown", "category"=>"Animals", "url"=>"http://apiv3.iucnredlist.org/api/v3/docs"}
# >> {"name"=>"Movebank", "description"=>"Movement and Migration data of animals", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://github.com/movebank/movebank-api-doc"}
# >> {"name"=>"Petfinder", "description"=>"Adoption", "auth"=>"OAuth", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://www.petfinder.com/developers/v2/docs/"}
# >> {"name"=>"PlaceGOAT", "description"=>"Placeholder goat images", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://placegoat.com/"}
# >> {"name"=>"RandomCat", "description"=>"Random pictures of cats", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://aws.random.cat/meow"}
# >> {"name"=>"RandomDog", "description"=>"Random pictures of dogs", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://random.dog/woof.json"}
# >> {"name"=>"RandomFox", "description"=>"Random pictures of foxes", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://randomfox.ca/floof/"}
# >> {"name"=>"RescueGroups", "description"=>"Adoption", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://userguide.rescuegroups.org/display/APIDG/API+Developers+Guide+Home"}
# >> {"name"=>"Shibe.Online", "description"=>"Random pictures of Shibu Inu, cats or birds", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"http://shibe.online/"}
您的代码存在以下问题:
- 使用与第二个相同,因为第二个更干净,因此视觉噪音更小
与文档中介绍的
相比,at
返回的内容还有其他更细微的差异。我强烈建议阅读并尝试他们的例子,因为知道什么时候该用什么可以省去你的头疼search
- 依赖绝对XPath选择器:绝对选择器非常脆弱。对HTML的任何更改都极有可能中断。相反,找到有用的节点来检查它们是否唯一,并让解析器找到它们
使用CSS选择器
跳过所有节点,直到找到“article”节点,在其中查找子节点“li”,并在“a”之后查找。您可以使用XPath做同样的事情,但它在视觉上很混乱。我非常喜欢让我的代码尽可能容易阅读和理解 类似地,“article li a”
查找“article”节点下的第一个表,然后at('article table')
只查找该表中的嵌入行 因为您想跳过表头search('tr')
切片节点集并跳过第一行[1..-1]
- 使构建结构更容易:
通过该行循环一次将字段分配给rows = doc.at('article table').search('tr')[1..-1].map { |tr|
行
与每个“td”节点文本的节点集文本一起分配值
- 通过使用构造函数并传入一个键/值对数组,可以轻松构建哈希
从单元格中获取值并添加第二个数组,该数组包含行中的类别和链接FIELDS.zip(values + [category, link])
我的示例代码基本上是相同的模板,每次我用表刮一页。虽然会有一些细微的区别,但这是一个在表上的循环,提取单元格并将其转换为散列。在一个写得很干净的表上,甚至可以从表第一行的单元格文本中自动抓取散列的键。Welp,我确信我已经瞄准了tbody,这很尴尬。非常感谢
thead
,tbody
,虽然它们应该存在,但它们并不存在于web上的大多数文档中,因为这些文档是在添加之前创建的,或者文档的创建者并不在意。浏览器会自动添加它们,并在我们查看sou时显示它们