Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/303.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用请求模块的web抓取_Python_Json_Web Scraping - Fatal编程技术网

Python 使用请求模块的web抓取

Python 使用请求模块的web抓取,python,json,web-scraping,Python,Json,Web Scraping,我想从一个网站上提取评论,用这段代码我成功地提取评论 import requests from urllib.parse import unquote url = 'https://apicomment.detik.com/graphql' payload = {"query":"query search($type: String!, $size: Int!,$anchor: Int!, $sort: String!, $adsLabelKanal: Strin

我想从一个网站上提取评论,用这段代码我成功地提取评论

import requests
from urllib.parse import unquote

url = 'https://apicomment.detik.com/graphql'
payload = {"query":"query search($type: String!, $size: Int!,$anchor: Int!, $sort: String!, $adsLabelKanal: String, $adsEnv: String, $query: [ElasticSearchAggregation]) {\nsearch(type: $type, size: $size,page: $anchor, sort: $sort,adsLabelKanal: $adsLabelKanal, adsEnv: $adsEnv, query: $query){\npaging sorting counter counterparent profile hits {\nposisi hasAds results {\n id author content like prokontra  status news create_date pilihanredaksi refer liker { id } reporter { id status_report } child { id child parent author content like prokontra status create_date pilihanredaksi refer liker { id } reporter { id status_report } authorRefer } } } }}","variables":{"type":"comment","sort":"newest","size":10,"anchor":1,"query":[{"name":"news.artikel","terms":5307853},{"name":"news.site","terms":"dtk"}],"adsLabelKanal":"detik_finance","adsEnv":"desktop"}}

while True:
    r = requests.post(url,json=payload)
    container = r.json()['data']['search']['hits']['results']
    if not container:
        break
    else:
        for item in container:
            if not len(item['author']):continue
            print(item['author']['name'],unquote(item['content']))

    payload['variables']['anchor']+=1
但是,事实上,我并不真正了解这段代码,尤其是这一行

url = 'https://apicomment.detik.com/graphql'
payload = {"query":"query search($type: String!, $size: Int!,$anchor: Int!, $sort: String!, $adsLabelKanal: String, $adsEnv: String, $query: [ElasticSearchAggregation]) {\nsearch(type: $type, size: $size,page: $anchor, sort: $sort,adsLabelKanal: $adsLabelKanal, adsEnv: $adsEnv, query: $query){\npaging sorting counter counterparent profile hits {\nposisi hasAds results {\n id author content like prokontra  status news create_date pilihanredaksi refer liker { id } reporter { id status_report } child { id child parent author content like prokontra status create_date pilihanredaksi refer liker { id } reporter { id status_report } authorRefer } } } }}","variables":{"type":"comment","sort":"newest","size":10,"anchor":1,"query":[{"name":"news.artikel","terms":5307853},{"name":"news.site","terms":"dtk"}],"adsLabelKanal":"detik_finance","adsEnv":"desktop"}}
url是不同的。但是,输出正是我真正想要的。
有人能给我解释一下并给我一些参考吗?

URL是不同的,因为您提取评论的不是网站本身,而是一个评论api。api提供了一种搜索评论的简单方法,无需对网站进行反向工程

paylod是如何告诉api您在寻找什么。 可能有一些文档说明了您的有效负载必须如何针对这个确切的api进行格式化