Python href没有属性'；获取'；从wikipedia检索第一个锚标记时_Python_Web Scraping_Beautifulsoup

Python href没有属性'；获取'；从wikipedia检索第一个锚标记时

python web-scraping

Python href没有属性'；获取'；从wikipedia检索第一个锚标记时,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在获取href没有属性“get”。我正在尝试检索此web爬虫中的第一个锚标记。我曾经像p.a.['href']那样直接提取href，并使用p.a.get（'href'）打印它。但当我将其分配给href1时，它会出错回溯（最近一次呼叫最后一次）：文件“/Users/asagarwala/IdeaProjects/Py1/new1.py”，第11行，在打印（soup.find（id=“mw content text”）.find（class='mw-parser- 输出“）.p.a.get

我正在获取

href没有属性“get”

。我正在尝试检索此web爬虫中的第一个锚标记。我曾经像

p.a.['href']

那样直接提取href，并使用

p.a.get（'href'）

打印它。但当我将其分配给href1时，它会出错

回溯（最近一次呼叫最后一次）：
文件“/Users/asagarwala/IdeaProjects/Py1/new1.py”，第11行，在
打印（soup.find（id=“mw content text”）.find（class='mw-parser-
输出“）.p.a.get（'href'））
AttributeError:“非类型”对象没有属性“get”
进程已完成，退出代码为1

这是我的密码：

import requests
from bs4 import BeautifulSoup

url1 = "https://en.wikipedia.org/wiki/Anger"

my_list = []
i = 1

while i < 26:
    html = requests.get(url1)
    soup = BeautifulSoup(html.text, 'html.parser')

    print(soup.find(id="mw-content-text").find(class_='mw-parser-output').p.a.get('href'))

    href1 = soup.find(id="mw-content-text").find(class_='mw-parser-output').p.a.get('href')
    url1 = "https://en.wikipedia.org" + href1
    i += 1

    if href1 == 'wiki/Philosophy':
        print("philosophy reached. Bye")
        break

    my_list.append(url1)

print(my_list)

导入请求
从bs4导入BeautifulSoup
url1=”https://en.wikipedia.org/wiki/Anger"
我的清单=[]
i=1
而我<26：
html=requests.get（url1）
soup=BeautifulSoup（html.text，'html.parser'）
打印（soup.find（id=“mw content text”）.find（class='mw-parser-output'）.p.a.get（'href'））
href1=soup.find（id=“mw content text”）.find（class='mw-parser-output'）.p.a.get（'href'））
url1=”https://en.wikipedia.org“+href1
i+=1
如果href1==“维基/哲学”：
打印（“哲学，再见”）
打破
my_list.append（url1）
打印（我的列表）

您的问题是在类中搜索第一个

标记。在第二次迭代（in）中，这是空的，因此不会得到任何结果

试试下面的方法

In [176]: def wiki_travel(url):
     ...:     visited = []
     ...:     for i in range(26):
     ...:         html = requests.get(url)
     ...:         if not html.ok:
     ...:             print("'{0}' got response code {1}".format(url, html.status_code))
     ...:             break
     ...:
     ...:         soup = bs4.BeautifulSoup(html.text, 'html.parser')
     ...:
     ...:         target = next((c.get('href') for p in soup.find(class_='mw-parser-output').findAll('p') for c in p.findAll('a') if c.get('href', '').startswith('/')), None)
     ...:         if not target:
     ...:             print('Target not found')
     ...:             break
     ...:
     ...:         print(target)
     ...:         url = 'https://en.wikipedia.org' + target
     ...:         if target == '/wiki/Philosophy':
     ...:             print('Philosophy reached. Bye')
     ...:             break
     ...:
     ...:         visited.append(url)
     ...:
     ...:     return visited
     ...:

测试这个

In [177]: wiki_travel('https://en.wikipedia.org/wiki/Anger')
/wiki/Emotion
/wiki/Consciousness
/wiki/Quality_(philosophy)
/wiki/Philosophy
Philosophy reached. Bye
Out[177]:
['https://en.wikipedia.org/wiki/Emotion',
 'https://en.wikipedia.org/wiki/Consciousness',
 'https://en.wikipedia.org/wiki/Quality_(philosophy)']

钥匙在下面一行

target = next((c.get('href') for p in soup.find(class_='mw-parser-output').findAll('p') for c in p.findAll('a') if c.get('href', '').startswith('/')), None)

这是怎么回事？这是一台发电机，类似于

target = []
# Search for all p tags within this class
for p in soup.find(class_='mw-parser-output').findAll('p'):
    # Find all a tags
    for c in p.findAll('a'):
        # Only add to target list iff the link starts with a '/'
        # I.e. no anchors ('#') which won't get us to a new page
        if c.get('href', '').startswith('/'):
            target.append(c.get('href'))

如果未找到结果，则获取目标[0]或

None

。

您的请求未找到任何内容，也未打印任何内容。在使用它们之前，必须检查

soup.find（id=“mw content text”）

和

soup.find（id=“mw content text”）.find（class='mw-parser-output'）

是否都存在。请求是200，但情感页面上的段落是：

，其中没有

子项。我猜你在做“哲学21步”的事情。你用什么样的策略达到目的？