如何在Python中只解析网页中的链接?
解析网页中的链接 如果您有任何帮助,我们将不胜感激。这是我从解析中得到的: 有专门的工具称为HTML解析器 下面是一个使用and的示例: 印刷品:如何在Python中只解析网页中的链接?,python,html,regex,html-parsing,Python,Html,Regex,Html Parsing,解析网页中的链接 如果您有任何帮助,我们将不胜感激。这是我从解析中得到的: 有专门的工具称为HTML解析器 下面是一个使用and的示例: 印刷品: from bs4 import BeautifulSoup import requests page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html') soup = BeautifulSoup(page.content) for
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
soup = BeautifulSoup(page.content)
for link in soup.find_all('a', href=True):
print link.get('href')
试试这个。看演示
通过Beautifulsoup
我很感激……真的很感激
#my current output#
http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/"
http://www.asecuritysite.com/content/icon_clown.gif" alt="if broken see alex@school.ac.uk +44(0)1314552759" height="100"
http://www.rottentomatoes.com/m/sleeper/"
http://www.rottentomatoes.com/m/sleeper/trailer/"
http://www.rottentomatoes.com/m/star_wars/"
http://www.rottentomatoes.com/m/star_wars/trailer/"
http://www.rottentomatoes.com/m/wargames/"
http://www.rottentomatoes.com/m/wargames/trailer/"
https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php"> SANS to Offer "Hacking Exposed Live"
https://www.sans.org/webcasts/archive/2013"
#I want to get this when i run the module#
http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/
http://www.asecuritysite.com/content/icon_clown.gif
http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/star_wars/
http://www.rottentomatoes.com/m/star_wars/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php
https://www.sans.org/webcasts/archive/2013
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
soup = BeautifulSoup(page.content)
for link in soup.find_all('a', href=True):
print link.get('href')
http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
...
\w+://\w+\.\w+\.\w+[^"]+
>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
>>> soup = BeautifulSoup(page.content)
>>> for i in soup.select('a[href]'):
print(i['href'])
http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
..................