Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/314.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
从python漂亮的抓取中查找内容时遇到问题_Python_Beautifulsoup_Python Requests_Screen Scraping - Fatal编程技术网

从python漂亮的抓取中查找内容时遇到问题

从python漂亮的抓取中查找内容时遇到问题,python,beautifulsoup,python-requests,screen-scraping,Python,Beautifulsoup,Python Requests,Screen Scraping,我正试图获取每篇文章标题的URL,这是一个“h3”“a”元素,例如,第一个结果是一个链接,带有文本“全长鼠标cDNA集合的功能注释”,链接到 我的搜索只返回“[]” 我的代码如下: import requests from bs4 import BeautifulSoup req = requests.get('https://www.lens.org/lens/scholar/search/results?q="edith%20cowan"') soup = BeautifulSoup(req

我正试图获取每篇文章标题的URL,这是一个“h3”“a”元素,例如,第一个结果是一个链接,带有文本“全长鼠标cDNA集合的功能注释”,链接到

我的搜索只返回“[]”

我的代码如下:

import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.lens.org/lens/scholar/search/results?q="edith%20cowan"')
soup = BeautifulSoup(req.content, "html5lib")
article_links = soup.select('h3 a')
print(article_links)

我哪里出错了?

您正在处理这个问题,因为您使用了错误的链接来获取文章链接。因此,我没有做太多更改,并提出了这段代码(请注意,我删除了bs4模块,因为不再需要它):


我在代码中添加了一些注释,使其更易于理解

谢谢-这样就行了,但我不知道为什么?你能给我解释一下你的解决方案吗?嘿,我刚刚编辑了我的答案并添加了一些解释。如果这有帮助,你可以点击答案附近的绿色勾号吗?谢谢你的解释,我想我明白了。我的计划是使用requests/beautifulsou方法迭代页面(我很熟悉),但我不确定如何使用该方法。你能建议如何最好地遍历结果的每一页以获得链接吗?哦,糟糕,我忘了解释其中一个关键点。我的方法之所以有效,是因为服务器在您的案例中发送的响应是json字符串的形式(老实说,这是非常罕见的)。我放在
req
末尾的
.json()
将响应转换为真正的json字符串。在大多数情况下,您可以使用requests/beautifulSoup,但在本例中不能使用,因为来自post的响应是json字符串的形式(并且没有任何HTML)。
import requests

search = "edith cowan"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

json = {"scholarly_search":{"from":0,"size":"10","_source":{"excludes":["referenced_by_patent_hash","referenced_by_patent","reference"]},"query":{"bool":{"must":[{"query_string":{"query":f"\"{search}\"","fields":["title","abstract","default"],"default_operator":"and"}}],"must_not":[{"terms":{"publication_type":["unknown"]}}],"filter":[]}},"highlight":{"pre_tags":["<span class=\"highlight\">"],"post_tags":["</span>"],"fields":{"title":{}},"number_of_fragments":0},"sort":[{"referenced_by_patent_count":{"order":"desc"}}]},"view":"scholar"}

req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json()

links = []
for x in req["query_result"]["hits"]["hits"]:
    links.append("https://www.lens.org/lens/scholar/article/{}/main".format(x["_source"]["record_lens_id"]))
import requests

search = "edith cowan" #Change this to the term you are searching for
r_to_show = 100 #This is the number of articles per page (I strongly recommend leaving it at 100)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

json = {"scholarly_search":{"from":0,"size":f"{r_to_show}","_source":{"excludes":["referenced_by_patent_hash","referenced_by_patent","reference"]},"query":{"bool":{"must":[{"query_string":{"query":f"\"{search}\"","fields":["title","abstract","default"],"default_operator":"and"}}],"must_not":[{"terms":{"publication_type":["unknown"]}}],"filter":[]}},"highlight":{"pre_tags":["<span class=\"highlight\">"],"post_tags":["</span>"],"fields":{"title":{}},"number_of_fragments":0},"sort":[{"referenced_by_patent_count":{"order":"desc"}}]},"view":"scholar"}

req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json()

links = [] #links are stored here
count = 0

#link_before and link_after helps determine when to stop going to the next page 
link_before = 0
link_after = 0

while True:
    json["scholarly_search"]["from"] += r_to_show
    if count > 0:
        req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json() 
    for x in req["query_result"]["hits"]["hits"]:
        links.append("https://www.lens.org/lens/scholar/article/{}/main".format(x["_source"]["record_lens_id"]))
    count += 1
    link_after = len(links)
    if link_after == link_before:
        break
    link_before = len(links)
    print(f"page {count} done, links recorder {len(links)}")