Python 当响应文本不'；我没有它所拥有的全部内容'；在我的浏览器中显示什么？_Python_Beautifulsoup

Python 当响应文本不'；我没有它所拥有的全部内容'；在我的浏览器中显示什么？

python

Python 当响应文本不'；我没有它所拥有的全部内容'；在我的浏览器中显示什么？,python,beautifulsoup,Python,Beautifulsoup,使用JupyterNotebook（ipynb），我试图用BeautifulSoup删除web内容，但响应文本没有显示在浏览器中的所有内容。我正在尝试提取文章标题和段落文本，但无法提取段落文本，因为它没有显示在我的浏览器中 url = https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2

使用JupyterNotebook（ipynb），我试图用BeautifulSoup删除web内容，但响应文本没有显示在浏览器中的所有内容。我正在尝试提取文章标题和段落文本，但无法提取段落文本，因为它没有显示在我的浏览器中

url = https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest

我转到浏览器中的url，看到了我要查找的内容：

<div class="article_teaser_body">New evidence suggests salty, shallow ponds once dotted a Martian crater — a sign of the planet's drying climate.</div>

结果列表为空

您不需要使用beautiful soup来刮取此url

https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest

当我检查网络选项卡时，我发现这个页面实际上是使用从API请求中获取的JSON获取文章正文，这可以很容易地使用

请求

库来完成

您可以尝试下面的代码

import requests
import json # need to use this to pretty print json easily 

headers = {
  'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' # to mimic regular browser user-agent
}

params = (
    ('page', '0'),
    ('per_page', '40'), # you tweak this value to 1000 or maybe more to get more data from a single request
    ('order', 'publish_date desc,created_at desc'),
    ('search', ''),
    ('category', '19,165,184,204'),
    ('blank_scope', 'Latest'),
)

# params is equivalent to page=1&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest

r = requests.get('https://mars.nasa.gov/api/v1/news_items/', params=params, headers=headers).json()

print(json.dumps(r, indent=4)) # prints the raw json respone

'''
The article data is contained inside the key `"items"`, we can iterate over `"items"` and 
print article title and body. Do check the raw json response to find 
other data included along with article title and body. You just need 
to use the key name to get those values like you see in the below code. 
'''

for article in r["items"]:
  print("Title :", article["title"])
  print("Body :",article["body"])

请实际查看。

这可能是因为这些元素是javascript呈现的，

请求

不会呈现javascript。您可以试试我发布的。请检查，如果有帮助，请接受它作为答案。

import requests
import json # need to use this to pretty print json easily 

headers = {
  'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' # to mimic regular browser user-agent
}

params = (
    ('page', '0'),
    ('per_page', '40'), # you tweak this value to 1000 or maybe more to get more data from a single request
    ('order', 'publish_date desc,created_at desc'),
    ('search', ''),
    ('category', '19,165,184,204'),
    ('blank_scope', 'Latest'),
)

# params is equivalent to page=1&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest

r = requests.get('https://mars.nasa.gov/api/v1/news_items/', params=params, headers=headers).json()

print(json.dumps(r, indent=4)) # prints the raw json respone

'''
The article data is contained inside the key `"items"`, we can iterate over `"items"` and 
print article title and body. Do check the raw json response to find 
other data included along with article title and body. You just need 
to use the key name to get those values like you see in the below code. 
'''

for article in r["items"]:
  print("Title :", article["title"])
  print("Body :",article["body"])