Javascript BeautifulSoup未返回Twitch.tv Viewcount_Javascript_Python_Web Scraping_Beautifulsoup_Twitch

Javascript BeautifulSoup未返回Twitch.tv Viewcount

javascript python web-scraping

Javascript BeautifulSoup未返回Twitch.tv Viewcount,javascript,python,web-scraping,beautifulsoup,twitch,Javascript,Python,Web Scraping,Beautifulsoup,Twitch,我正试图使用Python在www.twitch.tv/directory上抓取观众。我已经尝试过基本的BeautifulSoup脚本： url= 'https://www.twitch.tv/directory' html= urlopen(url) soup = BeautifulSoup(url, "html5lib") #also tried using html.parser, lxml soup.prettify() 这使我的html没有实际的观众人数显示然后我尝试使用param

我正试图使用Python在www.twitch.tv/directory上抓取观众。我已经尝试过基本的BeautifulSoup脚本：

url= 'https://www.twitch.tv/directory'
html= urlopen(url)
soup = BeautifulSoup(url, "html5lib") #also tried using html.parser, lxml
soup.prettify()

这使我的html没有实际的观众人数显示

然后我尝试使用param ajax数据

但是我得到了一个

JSONDecodeError:Expecting值：第1行第1列（char 0）

error

从那时起，我开始学习selenium

driver = webdriver.Edge()
url = 'https://www.twitch.tv/directory'
driver.get(url)
#Also tried driver.execute_script("return document.documentElement.outerHTML") and innerHTML
html = driver.page_source
driver.close()
soup = BeautifulSoup(html, "lxml")

这些结果与我从标准的BeautifulSoup调用中得到的结果相同

如果您有任何关于删除浏览次数的帮助，我们将不胜感激。

当页面首次加载时，统计信息不在页面中。页面发出graphql请求以获取游戏数据。当用户未登录graphql时，请求请求查询

AnonFrontPage\u TopChannel

以下是python中的一个工作请求：

import requests
import json

resp = requests.post(
    "https://gql.twitch.tv/gql",
    json.dumps(
        {
            "operationName": "AnonFrontPage_TopChannels",
            "variables": {"platformType": "all", "isTagsExperiment": True},
            "extensions": {
                "persistedQuery": {
                    "version": 1,
                    "sha256Hash": "d94b2fd8ad1d2c2ea82c187d65ebf3810144b4436fbf2a1dc3af0983d9bd69e9",
                }
            },
        }
    ),
    headers = {'Client-Id': 'kimne78kx3ncx6brgo4mv6wki5h1ko'},
)

print(json.loads(resp.content))

我已经在请求中包含了客户Id。id对于会话来说似乎不是唯一的，但我想Twitch会使它们过期，所以这可能永远不会起作用。您必须检查将来的graphql请求，并在将来获取一个新的客户机Id，或者找出如何通过编程从页面中删除一个

这个请求实际上似乎是最热门的直播频道部分。以下是获取视图计数和标题的方法：

edges = json.loads(resp.content)["data"]["streams"]["edges"]
games = [(f["node"]["title"], f["node"]["viewersCount"]) for f in edges]

# games:
[
    ("Let us GAME", 78250),
    ("(REBROADCAST) Worlds Play-In Knockouts: Cloud9 vs. Gambit Esports", 36783),
    ("RuneFest 2018 - OSRS Reveals !schedule", 35042),
    (None, 25237),
    ("Front Page of TWITCH + Fortnite FALL SKIRMISH Training!", 22380),
    ("Reckful - 3v3 with barry and a german", 20399),
]

您需要检查chrome网络检查器，并找出其他请求的结构，以获取更多数据

下面是目录页面的一个示例：

import requests
import json

resp = requests.post(
    "https://gql.twitch.tv/gql",
    json.dumps(
        {
            "operationName": "BrowsePage_AllDirectories",
            "variables": {
                "limit": 30,
                "directoryFilters": ["GAMES"],
                "isTagsExperiment": True,
                "tags": [],
            },
            "extensions": {
                "persistedQuery": {
                    "version": 1,
                    "sha256Hash": "75fb8eaa6e61d995a4d679dcb78b0d5e485778d1384a6232cba301418923d6b7",
                }
            },
        }
    ),
    headers={"Client-Id": "kimne78kx3ncx6brgo4mv6wki5h1ko"},
)

edges = json.loads(resp.content)["data"]["directoriesWithTags"]["edges"]
games = [f["node"] for f in edges]

我只是查看了页面并查看了源代码——似乎所有数据都是通过javascript获取的，那里没有“正常”的HTML。因此，不可能从HTML中替换这些数据，就像BeatifulSoup所做的那样——它们解析HTML，不能同时运行Javascript。@RobinZigmond Hi Robin。有没有其他方法可以获取这些数据供我研究？谢谢。恐怕我真的不知道，我不使用twitch。twitch似乎有一个API，正如我预期的那样：-我想你可以从中获得你需要的信息，但我无法帮助你如何使用它。@RobinZigmond我会研究它。这可能是最简单的方法谢谢你的快速回复！您将如何修改operationName以提供top game的“twitch.tv/目录”视图？编辑：我明白了，我得去网络检查员银行了！这正是我想要的

import requests
import json

resp = requests.post(
    "https://gql.twitch.tv/gql",
    json.dumps(
        {
            "operationName": "BrowsePage_AllDirectories",
            "variables": {
                "limit": 30,
                "directoryFilters": ["GAMES"],
                "isTagsExperiment": True,
                "tags": [],
            },
            "extensions": {
                "persistedQuery": {
                    "version": 1,
                    "sha256Hash": "75fb8eaa6e61d995a4d679dcb78b0d5e485778d1384a6232cba301418923d6b7",
                }
            },
        }
    ),
    headers={"Client-Id": "kimne78kx3ncx6brgo4mv6wki5h1ko"},
)

edges = json.loads(resp.content)["data"]["directoriesWithTags"]["edges"]
games = [f["node"] for f in edges]