Python 用BeautifulSoup刮g2a[dot]com_Python_Web Scraping_Beautifulsoup_Urllib

Python 用BeautifulSoup刮g2a[dot]com

python web-scraping

Python 用BeautifulSoup刮g2a[dot]com,python,web-scraping,beautifulsoup,urllib,Python,Web Scraping,Beautifulsoup,Urllib,我正试图从这个游戏网站（g2a[dot]com）上搜刮我正在寻找的游戏的最佳价格列表。价格通常在一张表中（见图）我进入表格的代码是： for gTitle in gameList: page = urllib.request.urlopen('http://www.g2a.com/%s.html' %gTitle).read() soup = BeautifulSoup(page, 'lxml') table = soup.find('table',class_='m

我正试图从这个游戏网站（g2a[dot]com）上搜刮我正在寻找的游戏的最佳价格列表。价格通常在一张表中（见图）

我进入表格的代码是：

for gTitle in gameList:
    page = urllib.request.urlopen('http://www.g2a.com/%s.html' %gTitle).read()
    soup = BeautifulSoup(page, 'lxml')
    table = soup.find('table',class_='mp-user-rating')

但当我打印表格时，我发现python将网站中的所有表格合并在一起，没有任何内容：

>>> <table class="mp-user-rating jq-wh-offers wh-table"></table>

这是一个错误还是我做错了什么？我将Python3.6.1与BeautifulSoup4和urllib一起使用。如果可能的话，我想继续使用这些产品，但我愿意改变。

A佩德罗建议，我已经尝试过硒，事实上它完成了任务。谢谢你，佩德罗！对于感兴趣的人，我的代码：

# importing packages
from selenium import webdriver

# game list
gameList = ['mass-effect-andromeda-origin-cd-key-preorder-global',\
            'total-war-warhammer-steam-cd-key-preorder-global',\
            'starcraft-2-heart-of-the-swarm-cd-key-global-1']

# scraping
chromePath = r"C:\Users\userName\Documents\Python\chromedriver.exe"
for gTitle in gameList:
    driver = webdriver.Chrome(chromePath)
    driver.get('http://www.g2a.com/%s.html' %gTitle)
    table = driver.find_element_by_xpath("""//*[@id="about-game"]/div/div[3]/div[1]/table/tbody""")
    bestPrice = ''.join(list(table.text.split('\n'))[2][12:][:6])
    bestPrice = float(bestPrice.replace(",","."))
    print(bestPrice)

我查看了网站。当你点击“加载更多”时，它会加载游戏列表，从现在开始。如果在inspect元素中查看浏览器的网络选项卡并仅过滤“xhr”请求，则可以看到加载新游戏集所需的api端点。我已将此api端点用作我的url

import requests,json
pageNum = 0 # start with 0, (Also using lower than 0 will start it from 0)
while True :
    url = "https://www.g2a.com/lucene/search/filter?&minPrice=0.00&maxPrice=10000&cn=&kr=&stock=all&event=&platform=0&search=&genre=0&cat=0&sortOrder=popularity+desc&start={}&rows=12&steam_app_id=&steam_category=&steam_prod_type=&includeOutOfStock=&includeFreeGames=false&_=1492758607443".format(str(pageNum))

    games_list = json.loads(requests.get(url).text)['docs'] # `games_list` contains each game as a dictionary from where you can take out the required information.  

    if len(games_list) == 0:
        break # we break off here when the maximum of start parameter is reached and the games_list is empty.
    else:
        pageNum += 12 # we use an increment of 12 because we observed an increment of 12 in the start parameter each time we click on "LOAD MORE"

您需要的是用javascript生成的，而用bs无法获得。考虑使用硒