尝试在ruby中使用OpenURI时,一些HTML内容将以“";“正在加载…”;
我正在尝试创建一个程序来比较网页上的特定内容,然后再比较另一次,我目前正在努力获取将发生变化的信息。但是,如果我检查页面中的元素,会出现更改的文本,但如果我使用OpenURI,则不会出现更改,它以“加载…”的形式出现(见图),有没有办法获取所有HTML文本 这是我目前的代码尝试在ruby中使用OpenURI时,一些HTML内容将以“";“正在加载…”;,ruby,open-uri,Ruby,Open Uri,我正在尝试创建一个程序来比较网页上的特定内容,然后再比较另一次,我目前正在努力获取将发生变化的信息。但是,如果我检查页面中的元素,会出现更改的文本,但如果我使用OpenURI,则不会出现更改,它以“加载…”的形式出现(见图),有没有办法获取所有HTML文本 这是我目前的代码 contents = open('https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841', &:read) File.open("testing
contents = open('https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841', &:read)
File.open("testing.txt", "w") do |line|
line.puts "\r" + "#{contents}"
end
有没有人帮我拿行李。。。更改为实际的HTML代码将是惊人的
谢谢您的网页包含
ajax请求
,openuri
只返回服务器端页面,它不会等待ajax请求
您可以使用下面的代码等待页面加载
#load the libraries
require 'watir'
browser = Watir::Browser.new
browser.goto "https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841"
# giving some time for website to load
sleep 2
puts browser.html
注意:您需要chromedriver
来使用脚本
如果您不想在浏览器中打开url,则可以使用headless WebKit解决此问题
因此,openuri只是发出HTTP请求并允许您访问主体。在本例中,主体是html。html中有一个用于此数据的占位符,这就是您所看到的。然后,html表示加载一些javascript,这些javascript将向服务器发出另一个请求以获取数据,当数据进入时,它将用真实数据替换占位符。因此,要处理这个问题,您最终需要javascript发出的请求返回的任何内容
三个解决方案
从我最不喜欢的到我最喜欢的
require 'uri'
require 'net/http'
# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)
# set some headers
req['origin'] = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache' # no caching, just in case,
req['pragma'] = 'no-cache' # we prob don't want stale data
# looks like you can pass it an awful lot of filters to use
req.set_form_data(
"page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
"distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
"entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
"minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
"filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
"displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
"inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
"daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
"minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
"minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
"maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)
# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"
# parse the response
require 'json'
json = JSON.parse res.body
# we're on page 1 of 1, and there are 48 results on this page
json['page'] # => 1
json['listings'].size # => 48
json['remainingResults'] # => false
# apparently we're looking at some sort of car or smth
json['modelId'] # => "d841"
json['modelName'] # => "Mazda MAZDASPEED6"
# a bunch of places sell this car
json['sellers'].size # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"
# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl'] # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString'] # => "$8,972"
listing['priceString'] # => "$6,890"
listing['daysOnMarket'] # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear'] # => 2006
listing['mileageString'] # => "81,803"
# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] } # => 0
json['listings'].count { |l| l['salvage'] } # => 0
# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
# ["Fair Deal", 11],
# ["No Price Analysis", 23],
# ["High Price", 8],
# ["Overpriced", 2]]
哦,好主意!请注意,在OSX上,您可以使用自制软件安装它:
brew cask install chromedriver
谢谢!!因此,通过使用json['listings'].size将获得出售的总数…因此我可以将其设置为一个变量,例如totalNum=json['listings'].size?谢谢你的帮助!是的,但请注意,这是唯一正确的b/c剩余结果是错误的。如果这是您唯一的目标,那么最好使用Rahul的解决方案实现的#2。但是,您应该找到一个具有多页结果的汽车示例,然后比较这两种方法。因为我在回复中没有看到任何东西可以告诉你有多少结果,这个页面上有多少结果,以及它是否是最后一页。因为这就是JS使用的,要么JS是错误的,要么它发出了很多请求,要么响应中有一些我没有看到的信息。另外,当我说“多页结果”时,我的意思是在API响应中,而不是在网页上。该网页一次显示15个结果,尽管它包含所有结果。因此,站点上的分页是一种UX选择,在本例中,它们已经拥有所有其他页面的所有数据,因为API响应只有一个页面。