Python 删除bs4中请求的href属性的某些部分_Python_Regex_Beautifulsoup_Python Requests

Python 删除bs4中请求的href属性的某些部分

python regex

Python 删除bs4中请求的href属性的某些部分,python,regex,beautifulsoup,python-requests,Python,Regex,Beautifulsoup,Python Requests,我需要一页上所有文章的摘要。我使用bs4获取所有文章的href内容，但有些文章的链接有另一个URL，我不需要它。我想删除那些项目。我使用了以下代码： import requests import re from bs4 import BeautifulSoup r = requests.get('https://davidventuri.medium.com/') soup = BeautifulSoup(r.text, 'html.parser') all_slugs = soup.

我需要一页上所有文章的摘要。我使用bs4获取所有文章的href内容，但有些文章的链接有另一个URL，我不需要它。我想删除那些项目。我使用了以下代码：

import requests
import re
from bs4 import BeautifulSoup



r = requests.get('https://davidventuri.medium.com/')


soup = BeautifulSoup(r.text, 'html.parser')
all_slugs = soup.find_all('a', {'class': 'dn br'})

for i in range(len(all_slugs)):
    slug = all_slugs[i]['href']
    print(slug)

以下是我获得hrefs的结果：

/this-is-not-a-real-data-science-degree-d170c660c1cf

/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4

/bitcoin-learning-path-9ed73f2f11d9

/your-first-day-of-school-eaf363b19ded

https://medium.com/free-code-camp/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b

https://medium.com/free-code-camp/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40

https://medium.com/free-code-camp/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0

https://medium.com/free-code-camp/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0

/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce

https://medium.com/free-code-camp/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e

实际上，我希望它们如下所示：

/this-is-not-a-real-data-science-degree-d170c660c1cf

/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4

/bitcoin-learning-path-9ed73f2f11d9

/your-first-day-of-school-eaf363b19ded

/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b

/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40

/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0

/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0

/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce

/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e

如何使用正则表达式或其他方法删除它们？

如果替换的子字符串始终相同，则可以不使用正则表达式，如下所示：

slug = a['href'].replace('https://medium.com/free-code-camp','')

示例

import requests
from bs4 import BeautifulSoup

r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text, 'html.parser')

all_slugs = soup.find_all('a', {'class': 'dn br'})

for a in all_slugs:
    slug = a['href'].replace('https://medium.com/free-code-camp','')
    print(slug)

import requests
from bs4 import BeautifulSoup

r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text, 'html.parser')

all_slugs = soup.find_all('a', {'class': 'dn br'})

for a in all_slugs:
    slug = a['href'].split('/')[-1]
    print(slug)

输出

/this-is-not-a-real-data-science-degree-d170c660c1cf
/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4
/bitcoin-learning-path-9ed73f2f11d9
/your-first-day-of-school-eaf363b19ded
/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b
/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40
/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0
/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0
/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce
/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e

编辑您还可以使用

split（）

示例

import requests
from bs4 import BeautifulSoup

r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text, 'html.parser')

all_slugs = soup.find_all('a', {'class': 'dn br'})

for a in all_slugs:
    slug = a['href'].replace('https://medium.com/free-code-camp','')
    print(slug)

import requests
from bs4 import BeautifulSoup

r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text, 'html.parser')

all_slugs = soup.find_all('a', {'class': 'dn br'})

for a in all_slugs:
    slug = a['href'].split('/')[-1]
    print(slug)

事实上，问题是它们并不总是常量。@哈尼：编辑了我的答案，并添加了

split（）

作为非正则表达式解决方案。