如何使用Python从HTML获取href链接？_Python_Html_Hyperlink_Beautifulsoup_Href

如何使用Python从HTML获取href链接？

python html hyperlink

如何使用Python从HTML获取href链接？,python,html,hyperlink,beautifulsoup,href,Python,Html,Hyperlink,Beautifulsoup,Href,到目前为止还不错但是我只需要来自纯文本HTML的href链接。我怎样才能解决这个问题您可以使用该模块代码可能如下所示： import urllib2 website = "WEBSITE" openwebsite = urllib2.urlopen(website) html = getwebsite.read() print html from bs4 import BeautifulSoup import urllib.request html_page = urllib.re

到目前为止还不错

但是我只需要来自纯文本HTML的href链接。我怎样才能解决这个问题

您可以使用该模块

代码可能如下所示：

import urllib2

website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
    print(link.get('href'))

注意：在Python 3.0中，HTMLParser模块已重命名为html.parser。将源代码转换为3.0时，2to3工具将自动调整导入。

尝试：

如果您只需要以http://开头的链接，则应使用：

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

在带有BS4的Python 3中，它应该是：

soup.findAll('a', attrs={'href': re.compile("^http://")})

看看如何使用漂亮的souphtml解析库

您将执行以下操作：

import urllib2

website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
    print(link.get('href'))

与真正的大师相比，我的答案可能很糟糕，但是使用一些简单的数学、字符串切片、find和urllib，这个小脚本将创建一个包含链接元素的列表。我测试了谷歌，结果似乎是正确的。希望有帮助

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
    print link.get("href")

下面是@stephen答案的懒散版本

import html.parser
进口itertools
导入urllib.request
类LinkParser（html.parser.HTMLParser）：
def重置（自）：
super（）.reset（）
self.links=iter（[]）
def句柄\u开始标记（自身、标记、属性）：
如果标记=='a'：
对于属性中的（名称、值）：
如果name='href'：
self.links=itertools.chain（self.links，[value]）
def gen_链接（流、解析器）：
encoding=stream.headers.get\u content\u charset（）或“UTF-8”
对于流中的行：
解析器.feed（行.解码（编码））
从parser.links中获得收益

像这样使用它：

import urllib
test = urllib.urlopen("http://www.google.com").read()
sane = 0
needlestack = []
while sane == 0:
  curpos = test.find("href")
  if curpos >= 0:
    testlen = len(test)
    test = test[curpos:testlen]
    curpos = test.find('"')
    testlen = len(test)
    test = test[curpos+1:testlen]
    curpos = test.find('"')
    needle = test[0:curpos]
    if needle.startswith("http" or "www"):
        needlestack.append(needle)
  else:
    sane = 1
for item in needlestack:
  print item

将BS4用于此特定任务似乎有些过分

请尝试：

>>> parser = LinkParser()
>>> stream = urllib.request.urlopen('http://stackoverflow.com/questions/3075550')
>>> links = gen_links(stream, parser)
>>> next(links)
'//stackoverflow.com'

我在上找到了这段漂亮的代码，对我来说效果很好

我只在从公开其中文件\文件夹的web文件夹中提取文件列表的场景中进行了测试，例如：

我通过使用BeautifulSoup和Python 3的请求，得到了URL下文件\文件夹的排序列表：

website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))

现在回答这个问题已经很晚了，但它适用于最新的python用户：

import requests 
from bs4 import BeautifulSoup


page = requests.get('http://www.website.com')
bs = BeautifulSoup(page.content, features='lxml')
for link in bs.findAll('a'):
    print(link.get('href'))

不要忘记安装“请求”和“美化组”软件包以及“lxml”。将.text与get一起使用，否则将引发异常

“lxml”用于删除要使用哪个解析器的警告。您也可以使用“html.parser”，以适合您的情况为准。

此答案与其他具有

请求

和

美化组

的答案类似，但使用列表理解

由于

find_all（）

是Beauty Soup搜索API中最流行的方法，因此可以使用

Soup（“a”）

作为

Soup.findAll（“a”）

的快捷方式，并使用列表理解：

from bs4 import BeautifulSoup
import requests 


html_page = requests.get('http://www.example.com').text

soup = BeautifulSoup(html_page, "lxml")
for link in soup.findAll('a'):
    print(link.get('href'))

对我来说最简单的方法是：

import requests
from bs4 import BeautifulSoup

URL = "http://www.yourwebsite.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, features='lxml')
# Find links
all_links = [link.get("href") for link in soup("a")]
# Only external links
ext_links = [link.get("href") for link in soup("a") if "http" in link.get("href")]

从urlextract导入urlextract
从请求导入获取
url=“sample.com/samplepage/”
req=请求。获取（url）
text=req.text
#或者，如果您已经拥有html源：
#text=“这是用于ex的html”
text=text.replace（“”，“”）。replace（“”，“”）
提取器=URLExtract（）
打印（提取器.查找URL（文本））

输出：

['http://google.com/', 'http://yahoo.com/“]

例如，BeautifulSoup无法自动关闭

meta

标记。DOM模型无效，无法保证您能找到所需内容。bsoup的另一个问题是，链接的格式将更改为原始格式。因此，如果您想将原始链接更改为指向另一个资源，目前我仍然不知道如何使用bsoup实现这一点。有什么建议吗？并非所有链接都包含

http

。例如，如果您将站点编码为删除协议，则链接将以

开头。这意味着只需使用网站加载的任何协议（http:或https:）。最近有人提醒人们，Python 3不再支持BeautifulSoup3，最新版本将是BeautifulSoup4，您可以使用bs4导入BeautifulSoup中的

导入它。谢谢！但是使用link
代替a
。我意识到，如果链接包含特殊的HTML字符，例如&，它将转换为文本表示形式，如本例中的&
。如何保存原始字符串？我最喜欢这个解决方案，因为它不需要外部输入dependencies@swdev-我意识到这已经晚了几年，但url编码/解码是如何处理的。