Python 网页抓取时如何单步浏览网页

Python 网页抓取时如何单步浏览网页,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,我已经写了下面的代码来清理cargurus网站。搜索每页显示15个条目 我想迭代地从第1页移动到第n页,并刮取每一页。下面的代码应该做到这一点,但是在脚本的末尾,我有一个数据帧df,它将第一页的numPages复制了几次 我认为代码没有给计算机时间来接收请求,所以我添加了一个time.sleep(1)行,但这似乎不起作用 我做错了什么 # Import Modules from bs4 import BeautifulSoup as bs import requests import panda

我已经写了下面的代码来清理cargurus网站。搜索每页显示15个条目

我想迭代地从第1页移动到第n页,并刮取每一页。下面的代码应该做到这一点,但是在脚本的末尾,我有一个数据帧df,它将第一页的numPages复制了几次

我认为代码没有给计算机时间来接收请求,所以我添加了一个time.sleep(1)行,但这似乎不起作用

我做错了什么

# Import Modules
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import seaborn as sns
import time

#Utility Functions
def to_number(s):
    #Convert to  Number
    numval = int(s.replace(',',''))
    return numval

def get_location(s):
    #Convert to  City, State (SS), and zip (string)
    s = s.replace(',','')
    sList = s.split()
    n = len(sList)-1
    City = ''
    for word in sList[0:n-1]:
        City += word  + ' '
    City = City[:-1]
    State = sList[n-1]
    Zip = sList[n]
    return City, State, Zip

def get_YearMakeModelTrim(s):
    #Convert to  Year, Make, Model, Trim
    sList = s.split()
    n = len(sList)-1
    Year = sList[0]
    Make = sList[1]
    Model = sList[2]
    if n == 3:
        Trim = sList[3]
    else:
        Trim = "None"
    return Year, Make, Model, Trim

numPages = 10

baseURL = 'https://www.cargurus.com/Cars/inventorylisting/viewDetailsFilterViewInventoryListing.action?sourceContext=forSaleTab_false_0&newSearchFromOverviewPage=true&inventorySearchWidgetType=AUTO&entitySelectingHelper.selectedEntity=c24578&entitySelectingHelper.selectedEntity2=c25202&zip=03062&distance=50000&searchChanged=true&modelChanged=false&filtersModified=true#resultsPage={}'


data = []
for ii in range(numPages):
    URL = baseURL.format(ii+1)
    print(URL)

    r  = requests.get(URL).text
    time.sleep(1)
    soup = bs(r,'html.parser')

    stats = soup.find_all("div", attrs = {"class": "cg-dealFinder-result-stats"})
    deals = soup.find_all("div", attrs = {"class": "cg-dealFinder-result-deal"})
    titles = soup.find_all("h4", {"class":"cg-dealFinder-result-model"})

    for title, deal, stat in zip(titles,deals,stats):
        row = {}
        row["Price"] = to_number(stat.find('span').get_text()[1:])
        row["Mileage"] = to_number(stat.find_all("p")[1].text[9:])
        row["City"],  row["State"], row["Zip"] = get_location(stat.find_all("p")[2].text[10:])
        row["natAvgPrice"] = to_number(deal.find('span', attrs = {'class': 'nationalAvg'}).get_text()[17:])
        row["Year"], row["Make"],  row["Model"], row["Trim"] = get_YearMakeModelTrim(title.find('span', attrs = {'itemprop': 'name'}).get_text())
        row["NewUsed"] = title.find('span', attrs = {'class': 'invisibleLayer'}).get_text()[:-5]
        data.append(row)

df = pd.DataFrame(data)
#df = df.drop_duplicates()

sns.pairplot(x_vars=["Mileage"], y_vars=["Price"], data=df, hue="Trim", size=5)

此页面使用JavaScript/AJAX从url读取数据

‌​静脉列表法‌​源上下文=‌​forSaleTab\u false\u 0

它使用带有参数的
POST
请求,并且有参数
页面

from bs4 import BeautifulSoup
import requests

params = {
    'zip': '03062',
    'address': 'Nashua,+NH',
    'latitude': "42.73040008544922",
    'longitude': '-71.49479675292969',
    'distance': 50000,
    'selectedEntity': 'c24578',
    'entitySelectingHelper.selectedEntity2': 'c25202',
    'minPrice': '',
    'maxPrice': '', 
    'minMileage': '',   
    'maxMileage': '',   
    'transmission': 'ANY',
    'bodyTypeGroup': '',    
    'serviceProvider': '',  
    'page': 1,
    'filterBySourcesString': '',
    'filterFeaturedBySourcesString': '',
    'displayFeaturedListings': True,
    'searchSeoPageType': '',    
    'inventorySearchWidgetType': 'AUTO',
    'allYearsForTrimName': False,
    'daysOnMarketMin': '',  
    'daysOnMarketMax': '',
    'vehicleDamageCategoriesRaw': '',
    'minCo2Emission': '',
    'maxCo2Emission': '',
    'vatOnly': False,
    'minEngineDisplacement': '',
    'maxEngineDisplacement': '',
    'minMpg': '',
    'maxMpg': '',   
    'startYear': 2015,
    'endYear': 2016,
    'isRecentSearchView': False,
}

url = 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=forSaleTab_false_0'

display_keys = True

for x in range(1, 4):

    params['page'] = x

    response = requests.post(url, data=params)

    data = response.json()

    if display_keys:
        display_keys = False
        for key in data.keys():
            print('key:', key)
        for key in data['listings'][0].keys():
            print("data['listings'] key:", key)
        print('-----')

    print('--- offers number:', len( data['listings']), '---')
    for item in data['listings'][:10]:
        print(item['id'], data['modelName'], item['modelName'], item['trimName'])
结果键

key: listings
key: modelName
key: styleSet
key: modelId
key: serviceProviders
key: page
key: sellers
key: remainingResults
data['listings'] key: bodyType
data['listings'] key: fleet
data['listings'] key: serviceProviderId
data['listings'] key: saved
data['listings'] key: highwayFuelEconomy
data['listings'] key: modelId
data['listings'] key: nonwholesaleSellerId
data['listings'] key: isFranchiseDealer
data['listings'] key: regressionPrice
data['listings'] key: rating
data['listings'] key: listedDate
data['listings'] key: dealerRatingPriceAdjustment
data['listings'] key: isOEMCPO
data['listings'] key: sellerId
data['listings'] key: transmission
data['listings'] key: mainPictureUrl
data['listings'] key: monthlyPayment
data['listings'] key: price
data['listings'] key: exteriorColorName
data['listings'] key: id
data['listings'] key: isFeatured
data['listings'] key: mileage
data['listings'] key: makeId
data['listings'] key: zip
data['listings'] key: noPhotos
data['listings'] key: isCertified
data['listings'] key: msrpString
data['listings'] key: engineCylinders
data['listings'] key: expectedPriceString
data['listings'] key: trimName
data['listings'] key: daysOnMarket
data['listings'] key: scaleMainPictureOnLoad
data['listings'] key: vehicleDamageCategory
data['listings'] key: monthlyPaymentString
data['listings'] key: isOutlier
data['listings'] key: cityFuelEconomy
data['listings'] key: savingsAmount
data['listings'] key: ownerCount
data['listings'] key: absoluteRating
data['listings'] key: salvage
data['listings'] key: contacted
data['listings'] key: priceString
data['listings'] key: distance
data['listings'] key: originalPrice
data['listings'] key: sellerRating
data['listings'] key: mileageString
data['listings'] key: engineType
data['listings'] key: wheelSystemDisplay
data['listings'] key: isDisplayConquestSection
data['listings'] key: serviceProviderName
data['listings'] key: carYear
data['listings'] key: savingsRecommendation
data['listings'] key: specificOptionIds
data['listings'] key: lemon
data['listings'] key: vehicleIdentifier
data['listings'] key: bodyTypeGroupId
data['listings'] key: useAnonymousContactEmail
data['listings'] key: msrp
data['listings'] key: sellerCity
data['listings'] key: bodyTypeGroupName
data['listings'] key: savingsArrowImage
data['listings'] key: dealScore
data['listings'] key: frameDamaged
data['listings'] key: hasAccidents
data['listings'] key: isCPO
data['listings'] key: expectedPrice
data['listings'] key: engineDisplacement
data['listings'] key: priceDifferentialString
data['listings'] key: trimLevelName
data['listings'] key: isNew
data['listings'] key: modelName
data['listings'] key: bodyTypeId
data['listings'] key: theftTitle
data['listings'] key: fuelType
data['listings'] key: maxSeating
data['listings'] key: wheelSystem
data['listings'] key: isConquestEnabled
data['listings'] key: autoEntityId
data['listings'] key: franchiseMake
data['listings'] key: optionIds
data['listings'] key: makeName
-----
结果-对于每个请求,我只显示前10项(使用不同的
页面


此页面使用JavaScript/AJAX从url读取数据

‌​静脉列表法‌​源上下文=‌​forSaleTab\u false\u 0

它使用带有参数的
POST
请求,并且有参数
页面

from bs4 import BeautifulSoup
import requests

params = {
    'zip': '03062',
    'address': 'Nashua,+NH',
    'latitude': "42.73040008544922",
    'longitude': '-71.49479675292969',
    'distance': 50000,
    'selectedEntity': 'c24578',
    'entitySelectingHelper.selectedEntity2': 'c25202',
    'minPrice': '',
    'maxPrice': '', 
    'minMileage': '',   
    'maxMileage': '',   
    'transmission': 'ANY',
    'bodyTypeGroup': '',    
    'serviceProvider': '',  
    'page': 1,
    'filterBySourcesString': '',
    'filterFeaturedBySourcesString': '',
    'displayFeaturedListings': True,
    'searchSeoPageType': '',    
    'inventorySearchWidgetType': 'AUTO',
    'allYearsForTrimName': False,
    'daysOnMarketMin': '',  
    'daysOnMarketMax': '',
    'vehicleDamageCategoriesRaw': '',
    'minCo2Emission': '',
    'maxCo2Emission': '',
    'vatOnly': False,
    'minEngineDisplacement': '',
    'maxEngineDisplacement': '',
    'minMpg': '',
    'maxMpg': '',   
    'startYear': 2015,
    'endYear': 2016,
    'isRecentSearchView': False,
}

url = 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=forSaleTab_false_0'

display_keys = True

for x in range(1, 4):

    params['page'] = x

    response = requests.post(url, data=params)

    data = response.json()

    if display_keys:
        display_keys = False
        for key in data.keys():
            print('key:', key)
        for key in data['listings'][0].keys():
            print("data['listings'] key:", key)
        print('-----')

    print('--- offers number:', len( data['listings']), '---')
    for item in data['listings'][:10]:
        print(item['id'], data['modelName'], item['modelName'], item['trimName'])
结果键

key: listings
key: modelName
key: styleSet
key: modelId
key: serviceProviders
key: page
key: sellers
key: remainingResults
data['listings'] key: bodyType
data['listings'] key: fleet
data['listings'] key: serviceProviderId
data['listings'] key: saved
data['listings'] key: highwayFuelEconomy
data['listings'] key: modelId
data['listings'] key: nonwholesaleSellerId
data['listings'] key: isFranchiseDealer
data['listings'] key: regressionPrice
data['listings'] key: rating
data['listings'] key: listedDate
data['listings'] key: dealerRatingPriceAdjustment
data['listings'] key: isOEMCPO
data['listings'] key: sellerId
data['listings'] key: transmission
data['listings'] key: mainPictureUrl
data['listings'] key: monthlyPayment
data['listings'] key: price
data['listings'] key: exteriorColorName
data['listings'] key: id
data['listings'] key: isFeatured
data['listings'] key: mileage
data['listings'] key: makeId
data['listings'] key: zip
data['listings'] key: noPhotos
data['listings'] key: isCertified
data['listings'] key: msrpString
data['listings'] key: engineCylinders
data['listings'] key: expectedPriceString
data['listings'] key: trimName
data['listings'] key: daysOnMarket
data['listings'] key: scaleMainPictureOnLoad
data['listings'] key: vehicleDamageCategory
data['listings'] key: monthlyPaymentString
data['listings'] key: isOutlier
data['listings'] key: cityFuelEconomy
data['listings'] key: savingsAmount
data['listings'] key: ownerCount
data['listings'] key: absoluteRating
data['listings'] key: salvage
data['listings'] key: contacted
data['listings'] key: priceString
data['listings'] key: distance
data['listings'] key: originalPrice
data['listings'] key: sellerRating
data['listings'] key: mileageString
data['listings'] key: engineType
data['listings'] key: wheelSystemDisplay
data['listings'] key: isDisplayConquestSection
data['listings'] key: serviceProviderName
data['listings'] key: carYear
data['listings'] key: savingsRecommendation
data['listings'] key: specificOptionIds
data['listings'] key: lemon
data['listings'] key: vehicleIdentifier
data['listings'] key: bodyTypeGroupId
data['listings'] key: useAnonymousContactEmail
data['listings'] key: msrp
data['listings'] key: sellerCity
data['listings'] key: bodyTypeGroupName
data['listings'] key: savingsArrowImage
data['listings'] key: dealScore
data['listings'] key: frameDamaged
data['listings'] key: hasAccidents
data['listings'] key: isCPO
data['listings'] key: expectedPrice
data['listings'] key: engineDisplacement
data['listings'] key: priceDifferentialString
data['listings'] key: trimLevelName
data['listings'] key: isNew
data['listings'] key: modelName
data['listings'] key: bodyTypeId
data['listings'] key: theftTitle
data['listings'] key: fuelType
data['listings'] key: maxSeating
data['listings'] key: wheelSystem
data['listings'] key: isConquestEnabled
data['listings'] key: autoEntityId
data['listings'] key: franchiseMake
data['listings'] key: optionIds
data['listings'] key: makeName
-----
结果-对于每个请求,我只显示前10项(使用不同的
页面


使用
print()
显示页面中的url和数据-可能您总是阅读同一页面。我想我正在阅读同一页面。这就是这个问题的重点。为什么现在要转到下一页?我确实打印了网址。密码里有。URL随循环的每次迭代而变化。这是输出(简称)。对于前3个示例。https:///www.cargurus.com/Cars/i... =true#resultsPage=1https:///www.cargurus.com/Cars/i... =true#resultsPage=2https:///www.cargurus.com/Cars/i... =true#resultsPage=3因为我认为它使用javascript替换数据-事件如果您使用不同的URL,您将获得相同的数据,因为
请求
+
美化组
无法运行
javascript
。您可能必须使用
Selenium
来控制web浏览器,它将读取页面并运行javaScript。在Chrome/Firefox的DevTool中,我看到它使用url
https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=forSaleTab_false_0
获取一些数据。也许阅读这个页面(使用不同的参数),你可以得到所有你需要的JSON,你可以使用module
JSON
轻松地转换python字典,我想我正在阅读相同的页面。这就是这个问题的重点。为什么请求不抓取下一页html代码?我确实打印了网址。密码里有。URL随循环的每次迭代而变化。这是输出(简称)。对于前3个示例。https:///www.cargurus.com/Cars/i... =true#resultsPage=1https:///www.cargurus.com/Cars/i... =true#resultsPage=2使用
print()
显示页面中的url和数据-可能您总是阅读同一页面。我想我正在阅读同一页面。这就是这个问题的重点。为什么现在要转到下一页?我确实打印了网址。密码里有。URL随循环的每次迭代而变化。这是输出(简称)。对于前3个示例。https:///www.cargurus.com/Cars/i... =true#resultsPage=1https:///www.cargurus.com/Cars/i... =true#resultsPage=2https:///www.cargurus.com/Cars/i... =true#resultsPage=3因为我认为它使用javascript替换数据-事件如果您使用不同的URL,您将获得相同的数据,因为
请求
+
美化组
无法运行
javascript
。您可能必须使用
Selenium
来控制web浏览器,它将读取页面并运行javaScript。在Chrome/Firefox的DevTool中,我看到它使用url
https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=forSaleTab_false_0
获取一些数据。也许阅读这个页面(使用不同的参数),你可以得到所有你需要的JSON,你可以使用module
JSON
轻松地转换python字典,我想我正在阅读相同的页面。这就是这个问题的重点。为什么请求不抓取下一页html代码?我确实打印了网址。密码里有。URL随循环的每次迭代而变化。这是输出(简称)。对于前3个示例。https:///www.cargurus.com/Cars/i... =true#resultsPage=1https:///www.cargurus.com/Cars/i... =true#resultsPage=2