Python 如何从带有Beauty Soup的html页面中找到每个链接作为字符串？（findAll功能不适合此网站）_Python_Web Scraping_Beautifulsoup

Python 如何从带有Beauty Soup的html页面中找到每个链接作为字符串？（findAll功能不适合此网站）

python web-scraping

Python 如何从带有Beauty Soup的html页面中找到每个链接作为字符串？（findAll功能不适合此网站）,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想添加所有Dota2英雄的名字；它们以链接的形式出现，从中进入列表以下是我的测试代码： from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq my_url = 'https://dota2.gamepedia.com/Abaddon/Counters' print(Child_url+hero_link_list[0]) uClient = uReq(my_url) page_htm

我想添加所有Dota2英雄的名字；它们以链接的形式出现，从中进入列表

以下是我的测试代码：

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

my_url = 'https://dota2.gamepedia.com/Abaddon/Counters'
print(Child_url+hero_link_list[0])
uClient = uReq(my_url)
page_html = uClient.read()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div", {"class": "mw-parser-output"})

print(containers)

但是，在打印containers变量之后，这个div标记下的几乎所有信息都丢失了，只添加了一些注释。

我不知道为什么会这样。在这一步之后，我有了一个想法来刮取链接，但首先我需要将孔信息添加到容器中。

此脚本将打印此处的所有名称：

import requests
from bs4 import BeautifulSoup

url = 'https://dota2.gamepedia.com/Abaddon/Counters'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_heros = [a.text for a in soup.select('b > a')]

#print them:
print(*all_heros, sep='\n')

印刷品：

Ancient Apparition
Axe
Brewmaster
Doom
Lina
Lion
Mars
Outworld Devourer
Silencer
Shadow Demon
Death Prophet
Mirana
Bane
Batrider
Beastmaster
Chen
Techies
Bloodseeker
Necrophos
Nyx Assassin
Storm Spirit
Phantom Assassin
Io
Axe
Legion Commander
Centaur Warrunner
Oracle

{'Bad against...': ['Ancient Apparition',
                    'Axe',
                    'Brewmaster',
                    'Doom',
                    'Lina',
                    'Lion',
                    'Mars',
                    'Outworld Devourer',
                    'Silencer',
                    'Shadow Demon'],
 'Good against...': ['Death Prophet',
                     'Mirana',
                     'Bane',
                     'Batrider',
                     'Beastmaster',
                     'Chen',
                     'Techies',
                     'Bloodseeker',
                     'Necrophos',
                     'Nyx Assassin'],
 'Works well with...': ['Storm Spirit',
                        'Phantom Assassin',
                        'Io',
                        'Axe',
                        'Legion Commander',
                        'Centaur Warrunner',
                        'Oracle']}

编辑（要刮取类别，可以使用

.find_previous（）

函数）：

印刷品：

Ancient Apparition
Axe
Brewmaster
Doom
Lina
Lion
Mars
Outworld Devourer
Silencer
Shadow Demon
Death Prophet
Mirana
Bane
Batrider
Beastmaster
Chen
Techies
Bloodseeker
Necrophos
Nyx Assassin
Storm Spirit
Phantom Assassin
Io
Axe
Legion Commander
Centaur Warrunner
Oracle

{'Bad against...': ['Ancient Apparition',
                    'Axe',
                    'Brewmaster',
                    'Doom',
                    'Lina',
                    'Lion',
                    'Mars',
                    'Outworld Devourer',
                    'Silencer',
                    'Shadow Demon'],
 'Good against...': ['Death Prophet',
                     'Mirana',
                     'Bane',
                     'Batrider',
                     'Beastmaster',
                     'Chen',
                     'Techies',
                     'Bloodseeker',
                     'Necrophos',
                     'Nyx Assassin'],
 'Works well with...': ['Storm Spirit',
                        'Phantom Assassin',
                        'Io',
                        'Axe',
                        'Legion Commander',
                        'Centaur Warrunner',
                        'Oracle']}

你到底想提取什么？在你的文章中包含预期的输出。@Sushant，我想在下面添加所有信息，但它做不到。非常感谢。你能描述一下你的代码吗？能用芬德尔写吗？这些名字也分为3个不同的类别，我怎样才能把它们也删掉呢？@babakabdzadeh我更新了我的答案。可以使用

.find_all（）

完成，但需要使用lambda函数或更多行代码。CSS选择器

b>a

意味着直接在

标记下查找所有

标记。