Python 从站点中删除隐藏的排行榜数据_Python_Web Scraping

Python 从站点中删除隐藏的排行榜数据

python web-scraping

Python 从站点中删除隐藏的排行榜数据,python,web-scraping,Python,Web Scraping,有两个排行榜处于活动状态（每周/所有时间）。两者都加载在页面源中，但默认情况下显示的是每周。我试图从所有时间里搜集数据：但我似乎不够具体，因为我只从每周排行榜中获取数据下面是脚本： from bs4 import BeautifulSoup from selenium import webdriver import time driver = webdriver.Chrome(executable_path="/Users/Rob/Documents/Python/chromedr

有两个排行榜处于活动状态（每周/所有时间）。两者都加载在页面源中，但默认情况下显示的是每周。我试图从所有时间里搜集数据：但我似乎不够具体，因为我只从每周排行榜中获取数据

下面是脚本：

from bs4 import BeautifulSoup
from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path="/Users/Rob/Documents/Python/chromedriver")

driver.get('https://community.koodomobile.com')
time.sleep(5)

# Object is “results”, brackets make the object an empty list.
# We will be storing our data here.
# Add the page source to the variable `content`.
content = driver.page_source
# Load the contents of the page, its source, into BeautifulSoup
# class, which analyzes the HTML as a nested data structure and allows to select
# its elements by using various selectors.
soup = BeautifulSoup(content)

# Xpath to alltime : /html/body/div[3]/div[12]/div/div/div/div[2]/div[1]/div/div/section/div/div/div/div/div[2]
# Xpath to weekly : /html/body/div[3]/div[12]/div/div/div/div[2]/div[1]/div/div/section/div/div/div/div/div[1]

div = soup.find('div', attrs={'class': 'qa-tab-content'})
table = div.find('table', attrs={'class': 'leaderboard-table'})
products = []
for link in table.find_all('a', attrs={'link--user'}):
    products.append(link.text)
for x in products:
    print(x)
driver.quit()

我在想也许我可以使用Xpath，但在BS中不受支持。所以，是的，我不知道下一步是什么。

数据是从外部URL加载的。您可以使用

请求

json

模块加载数据：

import json
import requests


url_week = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=thisWeek&maxResults=20&excludeRoles='
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='

# print for week:

data = requests.get(url_week).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for item in data:
    print(item['name'], item['points'])

print('-' * 80)

# print for all time:

data = requests.get(url_all_time).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for item in data:
    print(item['name'], item['points'])

印刷品：

nim4165 1274
Robert T 1216
Dennis 761
Mayumi 643
Goran 524
Dinh 329
Sophia 279
Philosoraptor 120
Lucas8320 104
Bernard Koodo 100
notYetACustomer 85
Allan M 83
BobTheElectrician 72
Timo Tuokkola 71
Chrisowhy 71
Bernpow 71
LR 70
Wintyer 66
Mpeeg 65
Bailzellia 61
--------------------------------------------------------------------------------
Dennis 52349
Dinh 40790
Sophia 25967
Mayumi 25178
Goran 24552
Robert T 19718
Allan M 19649
Bernard Koodo 14323
nim4165 13305
Timo Tuokkola 11206
rikkster 7338
David AKU 5688
Ranjan Koodo 4506
BobTheElectrician 4124
Helen Koodo 3370
Mihaela Koodo 2764
Fred C 2537
Philosoraptor 2102
Paul Deschamps 1973
Emilia Koodo 1755

数据是从外部URL加载的。您可以使用

请求

json

模块加载数据：

import json
import requests


url_week = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=thisWeek&maxResults=20&excludeRoles='
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='

# print for week:

data = requests.get(url_week).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for item in data:
    print(item['name'], item['points'])

print('-' * 80)

# print for all time:

data = requests.get(url_all_time).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for item in data:
    print(item['name'], item['points'])

印刷品：

nim4165 1274
Robert T 1216
Dennis 761
Mayumi 643
Goran 524
Dinh 329
Sophia 279
Philosoraptor 120
Lucas8320 104
Bernard Koodo 100
notYetACustomer 85
Allan M 83
BobTheElectrician 72
Timo Tuokkola 71
Chrisowhy 71
Bernpow 71
LR 70
Wintyer 66
Mpeeg 65
Bailzellia 61
--------------------------------------------------------------------------------
Dennis 52349
Dinh 40790
Sophia 25967
Mayumi 25178
Goran 24552
Robert T 19718
Allan M 19649
Bernard Koodo 14323
nim4165 13305
Timo Tuokkola 11206
rikkster 7338
David AKU 5688
Ranjan Koodo 4506
BobTheElectrician 4124
Helen Koodo 3370
Mihaela Koodo 2764
Fred C 2537
Philosoraptor 2102
Paul Deschamps 1973
Emilia Koodo 1755

有两个

表格

带有

class=“Leadboard table”

。在

soup

对象上执行

find\u all

而不是

find

，并在索引1中获取结果。一旦我从页面源中找到了这两个表，我如何指定之后需要什么数据？似乎find（）函数给了我一个AttributeError:ResultSet对象没有“find_all”属性。您可能将元素列表视为单个元素。当您打算调用find（）时，是否调用了find_all（）？如果我更改为find_all（），则会给出相同的错误代码。有两个

表s与class=“leadboard table”
。在soup
对象上执行find\u all
而不是find
，并在索引1中获取结果。一旦我从页面源中找到了这两个表，我如何指定之后需要什么数据？似乎find（）函数给了我一个AttributeError:ResultSet对象没有“find_all”属性。您可能将元素列表视为单个元素。当您打算调用find（）时，是否调用了find_all（）？如果我改为find_all（）会给出相同的错误代码。@NoobAtPython我打开了Firefox开发者工具->网络选项卡，有两个请求包含了所需的数据。是什么让你查看网络选项卡的？我确实在那里看到了points.leaderbord xhr文件，但我从来没有想过要在那里查看。源代码中是否有东西表明数据是从外部加载的？@NoobAtPython我已经完成了print（soup）
，但还没有在那里看到数据-但是页面必须从“某处”获取数据。通常，页面通过JavaScript/Ajax加载它。如果它在HTML代码中，它也应该打印出来（soup），不是吗？我的意思是：天哪，这么多的方法做一件简单的事情。肯定还有很多东西要学。谢谢@NoobAtPython我打开了Firefox开发者工具->网络选项卡，有两个请求包含所需的数据。是什么让你查看网络选项卡的？我确实在那里看到了points.leaderbord xhr文件，但我从来没有想过要在那里查看。源代码中是否有东西表明数据是从外部加载的？@NoobAtPython我已经完成了print（soup）
，但还没有在那里看到数据-但是页面必须从“某处”获取数据。通常，页面通过JavaScript/Ajax加载它。如果它在HTML代码中，它也应该打印出来（soup），不是吗？我的意思是：天哪，这么多的方法做一件简单的事情。肯定还有很多东西要学。谢谢