Python 从以下网站的链接中提取html链接_Python_Web Scraping_Beautifulsoup

Python 从以下网站的链接中提取html链接

python web-scraping

Python 从以下网站的链接中提取html链接,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想提取链接 /stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type= 从页面的html http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05 下面是使用的代码 ur

我想提取链接

/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=

从页面的html

http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05

下面是使用的代码

url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
soup = BeautifulSoup(html.text,'html.parser')
link = soup.find_all('a')
print(link)

用漂亮的汤。我该怎么做呢？使用find_all（'a'）不会在返回的html中返回所需的链接。

您只需使用

get

方法来查找

href

属性：

from bs4 import BeautifulSoup as soup
import requests

url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
page= soup(html.text,'html.parser')
link = page.find_all('a')
for l in link:
    print(l.get('href'))

您只需使用

get

方法来查找

href

属性：

from bs4 import BeautifulSoup as soup
import requests

url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
page= soup(html.text,'html.parser')
link = page.find_all('a')
for l in link:
    print(l.get('href'))

请尝试此操作以获取所需的确切Url

import bs4 as bs
import requests
import re


sauce = requests.get('https://www.moneycontrol.com/stocks/company_info/stock_news.php?sc_id=CHC&durationType=Y&Year=2018')

soup = bs.BeautifulSoup(sauce.text, 'html.parser')

for a in soup.find_all('a', href=re.compile("company_info")):
   # print(a['href'])
    if 'pageno' in a['href']:
        print(a['href'])

输出：

/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=3&next=0&durationType=Y&Year=2018&duration=1&news_type=

请尝试此操作以获取所需的确切Url

import bs4 as bs
import requests
import re


sauce = requests.get('https://www.moneycontrol.com/stocks/company_info/stock_news.php?sc_id=CHC&durationType=Y&Year=2018')

soup = bs.BeautifulSoup(sauce.text, 'html.parser')

for a in soup.find_all('a', href=re.compile("company_info")):
   # print(a['href'])
    if 'pageno' in a['href']:
        print(a['href'])

输出：

/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=3&next=0&durationType=Y&Year=2018&duration=1&news_type=

请分享你的代码。@balderman这是编辑过的版本，够了吗？请分享你的代码。@balderman这是编辑过的版本，够了吗？再次感谢你的帮助！！我在想，使用for循环然后使用u contains_u（pageno=2）明智吗这会很好，或者有更好的方法找到我想要的确切URL吗？你有很多方法可以在列表中找到匹配的值，使用contains循环，regex…如果不查找，我不知道什么是最好的，但是我认为你可以直接在

find\u all

方法中使用json的

re

包来执行regex，谢谢你再次帮助我！！试想一下，使用for循环然后使用_contains（pageno=2）明智吗会很好还是有更好的方法来找到我想要的确切URL？你有很多方法可以在列表中找到匹配值，循环包含，正则表达式…我不知道不查找最好的方法是什么，但我认为你可以直接在

查找所有方法中使用json的re
包来执行正则表达式。它还返回上面所有的URL我只想要指定的url，当我尝试在blink中为b执行blink=a['href']#打印（blink）：如果b.u包含（'pageno'）==True:print（b）
我什么也得不到，我怎么才能只提取我想要的URL呢？谢谢，我也是通过做类似的事情得到的。谢谢。当我尝试在blink中为b执行blink=a['href'].#print（blink）：if b.u包含uuu（'pageno'）==True:print（b）时，我也只想要指定的URL
我什么也得不到，我如何才能只提取我想要的URL？谢谢，我也是通过做类似的事情得到的。干杯