Python Can'；找不到包含beautifulsoup和请求的现有元素_Python_Html_Web Scraping_Beautifulsoup_Python Requests

Python Can'；找不到包含beautifulsoup和请求的现有元素

python html web-scraping

Python Can'；找不到包含beautifulsoup和请求的现有元素,python,html,web-scraping,beautifulsoup,python-requests,Python,Html,Web Scraping,Beautifulsoup,Python Requests,当我想从中提取所有数据时，我找不到所有元素。我特别使用class:search-featurecontainer查找div元素中的所有内容（它是顶部信息框中的内容），但是当我刮取它时，它只是说没有找到任何内容。这是我的代码： import requests from bs4 import BeautifulSoup def scrape_britannica(product_name): ### SETUP ### URL_raw = 'https://www.britanni

当我想从中提取所有数据时，我找不到所有元素。我特别使用class:search-featurecontainer查找div元素中的所有内容（它是顶部信息框中的内容），但是当我刮取它时，它只是说没有找到任何内容。这是我的代码：

import requests
from bs4 import BeautifulSoup

def scrape_britannica(product_name):
    ### SETUP ###
    URL_raw = 'https://www.britannica.com/search?query=' + product_name
    URL = URL_raw.strip().replace(" ", "+")
    ## gets the html from the url
    try:
        page = requests.get(URL)
    except:
        print("Could not find URL..")

    ## a way to come around scrape blocking
    soup = BeautifulSoup(page.content, 'html.parser')

    parent = soup.find("div", {"class": "search-feature-container"})
    print(parent)

scrape_britannica('carl barks')

我想这与你打开页面时没有在开始加载有关，但我仍然不知道如何修复它。或者可能是因为网站正在使用Cookies。我真的在寻找我能得到的所有想法！Thx:D

我会找到所有标签：script并检查其中是否有关键字：featuredSearchTopic。然后我将把文本转换成json（作为字典），然后访问数据“description”

import requests
from bs4 import BeautifulSoup
import json

def scrape_britannica(product_name):
    ### SETUP ###
    URL_raw = 'https://www.britannica.com/search?query=' + product_name
    URL = URL_raw.strip().replace(" ", "+")
    ## gets the html from the url
    try:
        page = requests.get(URL)
    except:
        print("Could not find URL..")

    ## a way to come around scrape blocking
    soup = BeautifulSoup(page.content, 'html.parser')
    #print(soup)

    for parent in soup.findAll("script"):  #, {"class": "search-feature-container"})
        if 'featuredSearchTopic' in str(parent):
            txt = json.loads(parent.text.replace(';','').split('=')[-1])
            print(txt.get('topicInfo').get('description'))


scrape_britannica('carl barks')

结果:

连环漫画：制度化：…迪斯尼的所有艺术家，卡尔·巴克，500多部最佳唐老鸭和其他故事的唯一创作者，从迪斯尼的匿名政策将使他成为一个崇拜人物的遗忘中获救。他收集的作品有30卷豪华的对开本……

我会找到所有标签：脚本，并检查其中是否有关键字：featuredSearchTopic。然后我将把文本转换成json（作为字典），然后访问数据“description”

import requests
from bs4 import BeautifulSoup
import json

def scrape_britannica(product_name):
    ### SETUP ###
    URL_raw = 'https://www.britannica.com/search?query=' + product_name
    URL = URL_raw.strip().replace(" ", "+")
    ## gets the html from the url
    try:
        page = requests.get(URL)
    except:
        print("Could not find URL..")

    ## a way to come around scrape blocking
    soup = BeautifulSoup(page.content, 'html.parser')
    #print(soup)

    for parent in soup.findAll("script"):  #, {"class": "search-feature-container"})
        if 'featuredSearchTopic' in str(parent):
            txt = json.loads(parent.text.replace(';','').split('=')[-1])
            print(txt.get('topicInfo').get('description'))


scrape_britannica('carl barks')

结果:

连环漫画：制度化：…迪斯尼的所有艺术家，卡尔·巴克，500多部最佳唐老鸭和其他故事的唯一创作者，从迪斯尼的匿名政策让他成为一个崇拜者的遗忘中获救。他的作品集有30卷豪华的对开本……

你正在处理一个

网站

，该网站运行

JavaScript

在页面加载后呈现数据，您可以使用以下方法加载包含您正在查找的部分的网站的

脚本源代码。现在你有了树
和目录
，所以你可以用它做任何事情
导入请求
从bs4导入BeautifulSoup
导入json
r=请求。获取（“https://www.britannica.com/search?query=world+战争+2”）
soup=BeautifulSoup（r.text'html.parser'）
script=soup.findAll(
“脚本”，{'type'：'text/javascript'}）[15]。获取文本（strip=True）
start=script.find（“{”）
end=script.rfind（“}”）+1
数据=脚本[开始：结束]
n=json.loads（数据）
打印（json.dumps（n，缩进=4））
#打印（n.keys（））
#打印（n[“主题信息”][“说明”]）

输出：
{
    "toc": [
        {
            "id": 1,
            "title": "Introduction",
            "url": "/event/World-War-II"
        },
        {
            "id": 53531,
            "title": "Axis initiative and Allied reaction",
            "url": "/event/World-War-II#ref53531"
        },
        {
            "id": 53563,
            "title": "The Allies\u2019 first decisive successes",
            "url": "/event/World-War-II/The-Allies-first-decisive-successes"
        },
        {
            "id": 53576,
            "title": "The Allied landings in Europe and the defeat of the Axis powers",
            "url": "/event/World-War-II/The-Allied-landings-in-Europe-and-the-defeat-of-the-Axis-powers"
        }
    ],
    "topicInfo": {
        "topicId": 648813,
        "imageId": 74903,
        "imageUrl": "https://cdn.britannica.com/s:300x1000/26/188426-050-2AF26954/Germany-Poland-September-1-1939.jpg",
        "imageAltText": "World War II",
        "title": "World War II",
        "identifier": "1939\u20131945",
        "description": "World War II, conflict that involved virtually every part of the world during the years 1939\u201345. The principal belligerents were the Axis powers\u2014Germany, Italy, and Japan\u2014and the Allies\u2014France, Great Britain, the United States, the Soviet Union, and, to a lesser extent, China. The war was in many...",
        "url": "/event/World-War-II"
    }
}

打印输出（n.keys（））

dict_键（['toc'，'topicInfo']）

打印输出（n[“topicInfo”][“description”]）

第二次世界大战，1939-45年间几乎涉及世界各地的冲突。主要交战方是轴心国德国、意大利和日本以及盟国法国、英国、美国、苏联，在较小程度上还有中国。战争发生在很多地方。。。
您正在处理一个运行JavaScript
的网站
，以便在页面加载后呈现其数据，您可以使用以下方法加载包含您正在查找的部分的网站的脚本
源。现在你有了树
和目录
，所以你可以用它做任何事情
导入请求
从bs4导入BeautifulSoup
导入json
r=请求。获取（“https://www.britannica.com/search?query=world+战争+2”）
soup=BeautifulSoup（r.text'html.parser'）
script=soup.findAll(
“脚本”，{'type'：'text/javascript'}）[15]。获取文本（strip=True）
start=script.find（“{”）
end=script.rfind（“}”）+1
数据=脚本[开始：结束]
n=json.loads（数据）
打印（json.dumps（n，缩进=4））
#打印（n.keys（））
#打印（n[“主题信息”][“说明”]）

输出：
{
    "toc": [
        {
            "id": 1,
            "title": "Introduction",
            "url": "/event/World-War-II"
        },
        {
            "id": 53531,
            "title": "Axis initiative and Allied reaction",
            "url": "/event/World-War-II#ref53531"
        },
        {
            "id": 53563,
            "title": "The Allies\u2019 first decisive successes",
            "url": "/event/World-War-II/The-Allies-first-decisive-successes"
        },
        {
            "id": 53576,
            "title": "The Allied landings in Europe and the defeat of the Axis powers",
            "url": "/event/World-War-II/The-Allied-landings-in-Europe-and-the-defeat-of-the-Axis-powers"
        }
    ],
    "topicInfo": {
        "topicId": 648813,
        "imageId": 74903,
        "imageUrl": "https://cdn.britannica.com/s:300x1000/26/188426-050-2AF26954/Germany-Poland-September-1-1939.jpg",
        "imageAltText": "World War II",
        "title": "World War II",
        "identifier": "1939\u20131945",
        "description": "World War II, conflict that involved virtually every part of the world during the years 1939\u201345. The principal belligerents were the Axis powers\u2014Germany, Italy, and Japan\u2014and the Allies\u2014France, Great Britain, the United States, the Soviet Union, and, to a lesser extent, China. The war was in many...",
        "url": "/event/World-War-II"
    }
}

打印输出（n.keys（））

dict_键（['toc'，'topicInfo']）

打印输出（n[“topicInfo”][“description”]）

第二次世界大战，1939-45年间几乎涉及世界各地的冲突。主要交战方是轴心国德国、意大利和日本以及盟国法国、英国、美国、苏联，在较小程度上还有中国。战争发生在很多地方。。。
这样div就可以通过他们正在使用的某种异步JS代码动态添加。我不认为BeautifulSoup能够处理这个问题，因为它可以在静态文本输入上工作。还有一篇文章你可能会感兴趣：你能告诉我们你期望的输出是什么吗？这样div就可以通过他们正在使用的某种异步JS代码动态添加。我不认为BeautifulSoup能够处理这个问题，因为它可以在静态文本输入上工作。还有一个帖子你可能会感兴趣：你能告诉我们你的预期产出吗？