Python bs4：为什么我只看到HTML的一部分？_Python_Web Scraping_Beautifulsoup

Python bs4：为什么我只看到HTML的一部分？

python web-scraping

Python bs4：为什么我只看到HTML的一部分？,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我用bs4来搜索产品举一个例子，当我使用下面的代码刮取它时，“讨论”部分完全不存在 res = requests.get('https://producthunt.com/posts/weights-biases') soup = bs4.BeautifulSoup(res.text, 'html.parser') pprint.pprint(soup.prettify()) 我怀疑这与延迟加载有关（当您打开页面时，“讨论”部分需要额外的一两秒钟才能出现）如何刮取延迟加载的组件？或者这完全

我用bs4来搜索产品

举一个例子，当我使用下面的代码刮取它时，“讨论”部分完全不存在

res = requests.get('https://producthunt.com/posts/weights-biases')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
pprint.pprint(soup.prettify())

我怀疑这与延迟加载有关（当您打开页面时，“讨论”部分需要额外的一两秒钟才能出现）

如何刮取延迟加载的组件？或者这完全是另一回事？

这就是你如何获得讨论中的评论的方法。您可以随时更正脚本以获得每个线程得到的相关回复

import json
import requests
from pprint import pprint

url = 'https://www.producthunt.com/frontend/graphql'
payload = {"operationName":"PostPageCommentsSection","variables":{"commentsListSubjectThreadsCursor":"","commentsThreadRepliesCursor":"","slug":"weights-biases","includeThreadForCommentId":None,"commentsListSubjectThreadsLimit":10},"query":"query PostPageCommentsSection($slug:String$commentsListSubjectThreadsCursor:String=\"\"$commentsListSubjectThreadsLimit:Int!$commentsThreadRepliesCursor:String=\"\"$commentsListSubjectFilter:ThreadFilter$includeThreadForCommentId:ID$excludeThreadForCommentId:ID){post(slug:$slug){id canManage ...PostPageComments __typename}}fragment PostPageComments on Post{_id id slug name ...on Commentable{_id id canComment __typename}...CommentsSubject ...PostReviewable ...UserSubscribed meta{canonicalUrl __typename}__typename}fragment PostReviewable on Post{id slug name canManage featuredAt createdAt disabledWhenScheduled ...on Reviewable{_id id reviewsCount reviewsRating isHunter isMaker viewerReview{_id id sentiment comment{id body __typename}__typename}...on Commentable{canComment commentsCount __typename}__typename}meta{canonicalUrl __typename}__typename}fragment CommentsSubject on Commentable{_id id ...CommentsListSubject __typename}fragment CommentsListSubject on Commentable{_id id threads(first:$commentsListSubjectThreadsLimit after:$commentsListSubjectThreadsCursor filter:$commentsListSubjectFilter include_comment_id:$includeThreadForCommentId exclude_comment_id:$excludeThreadForCommentId){edges{node{_id id ...CommentThread __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}__typename}fragment CommentThread on Comment{_id id isSticky replies(first:5 after:$commentsThreadRepliesCursor allForCommentId:$includeThreadForCommentId){edges{node{_id id ...Comment __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}...Comment __typename}fragment Comment on Comment{_id id badges body bodyHtml canEdit canReply canDestroy createdAt isHidden path repliesCount subject{_id id ...on Commentable{_id id __typename}__typename}user{_id id headline name firstName username headline ...UserSpotlight __typename}poll{...PollFragment __typename}review{id sentiment __typename}...CommentVote ...FacebookShareButtonFragment __typename}fragment CommentVote on Comment{_id id ...on Votable{_id id hasVoted votesCount __typename}__typename}fragment FacebookShareButtonFragment on Shareable{id url __typename}fragment UserSpotlight on User{_id id headline name username ...UserImage __typename}fragment UserImage on User{_id id name username avatar headline isViewer ...KarmaBadge __typename}fragment KarmaBadge on User{karmaBadge{kind score __typename}__typename}fragment PollFragment on Poll{id answersCount hasAnswered options{id text imageUuid answersCount answersPercent hasAnswered __typename}__typename}fragment UserSubscribed on Subscribable{_id id isSubscribed __typename}"}

r = requests.post(url,json=payload)
for item in r.json()['data']['post']['threads']['edges']:
    pprint(item['node']['body'])

此时的输出：

('Looks like such a powerful tool for extracting performance insights! '
 'Absolutely love the documentation feature, awesome work!')
('This is awesome and so Any discounts or special pricing for '
 'researchers/students/non-professionals?')
'Amazing. I think this is very helpful tools  for us. Keep it up & go ahead.'
('<p>This simple system of record automatically saves logs from every '
 'experiment, making it easy to look over the history of your progress and '
 'compare new models with existing baselines.</p>\n'
 'Pros: <p>Easy, fast, and lightweight experiment tracking</p>\n'
 'Cons: <p>Only available for Python projects</p>')
('Very cool! I hacked together something similar but much more basic for '
 "personal use and always wondered why TensorBoard didn't solve this problem. "
 'I just wish this was open source! :) P.S. awesome use of the parallel '
 'co-ordinates d3.js chart - great idea to apply it to experiment '
 'configurations!')

（“看起来是提取性能洞察的强大工具！”
“绝对喜欢文档功能，很棒的工作！”）
（“这太棒了，所以有折扣或特别定价吗”
“研究人员/学生/非专业人士？”）
“太棒了。我认为这对我们来说是非常有用的工具。坚持下去，继续前进。”
（“这个简单的记录系统会自动保存每个日志”
“实验，让你更容易回顾自己的进步和进步的历史”
'将新模型与现有基线进行比较。\n'
'优点：简单、快速、轻便的实验跟踪\n'
'缺点：仅适用于Python项目'）
（“很酷！我为你拼凑了一些类似但更基本的东西”
“个人使用，总是想知道为什么TensorBoard不能解决这个问题。”
“我只是希望这是开源的！：）P.S.并行的使用真是太棒了”
“坐标d3.js图表-将其应用于实验的好主意”
“配置！”）

页面的某些元素似乎是通过Javascript查询动态加载的

请求

库允许您手动发送查询，然后使用bs4解析更新页面的内容

然而，根据我在动态网页方面的经验，如果你有很多查询要发送，这种方法会非常烦人

通常，在这些情况下，最好使用集成实时浏览器模拟的库。这样，模拟器本身将处理客户机-服务器通信并更新页面；您只需等待加载元素，然后安全地分析它们

因此，我建议您看看selenium
甚至selenium请求
，如果您希望保留

请求

的“哲学”。

我没有看，但我敢打赌，讨论是通过JavaScript动态添加的，它不在HTML中。这看起来是一个非常有趣的方法。你能再解释一下你从哪里得到的有效载荷吗？我如何在谷歌上搜索这种方法？是否有您可以共享的旅游链接？请查看图5中的

了解您可以在哪里找到该链接。