使用python和BeautifulSoup从网页检索链接_Python_Web Scraping_Hyperlink_Beautifulsoup

使用python和BeautifulSoup从网页检索链接

python web-scraping hyperlink

使用python和BeautifulSoup从网页检索链接,python,web-scraping,hyperlink,beautifulsoup,Python,Web Scraping,Hyperlink,Beautifulsoup,如何使用Python检索网页的链接并复制链接的url地址？下面是一个使用BeautifulSoupTrainer类的简短片段： import urllib2 import BeautifulSoup request = urllib2.Request("http://www.gpsbasecamp.com/national-parks") response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(respons

如何使用Python检索网页的链接并复制链接的url地址？

下面是一个使用BeautifulSoupTrainer类的简短片段：

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

BeautifulSoup文档实际上相当不错，涵盖了许多典型场景：

编辑：请注意，我使用了SoupTrainer类，因为如果您事先知道要解析什么，它的效率（内存和速度）会更高。

仅用于获取链接，而不需要B.soup和regex:

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

导入urllib2
url=”http://www.somewhere.com"
page=urlib2.urlopen（url）
数据=page.read（）.split（“”）
tag=“”
对于数据中的项目：
如果“其他人推荐了BeautifulSoup，但它使用起来要好得多。尽管它叫BeautifulSoup，但它也是用于解析和抓取HTML的。它比BeautifulSoup快得多，甚至比BeautifulSoup（他们的名声）更好地处理“坏掉的”HTML。如果您不想学习lxml API，它也为BeautifulSoup提供了一个兼容API

没有理由再使用BeautifulSoup了，除非你使用的是Google App Engine或者其他不允许使用Python的东西
html还支持CSS3选择器，所以这类事情很简单
使用lxml和xpath的示例如下所示：
import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

为什么不使用正则表达式：
import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

导入urllib2
进口稀土
url=”http://www.somewhere.com"
page=urlib2.urlopen（url）
page=page.read（）
links=re.findall（r“”，第页）
对于链接中的链接：
打印（'href:%s，HTML文本：%s'（链接[0]，链接[1]））
在幕后，BeautifulSoup现在使用lxml。请求、lxml和列表理解构成了一个杀手组合
import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

在list comp中，“if'/'和'url.com'不在x中”是一种简单的方法，用于清除网站“内部”导航url的url列表等。
以下代码是使用urlib2
和BeautifulSoup4
检索网页中的所有可用链接：
import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

为了完整起见，BeautifulSoup 4版本也使用了服务器提供的编码：
from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

或Python 2版本：
from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

以及一个使用的版本，正如所写，该版本将在Python 2和Python 3中工作：
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

soup.find_all（'a'，href=True）
调用查找具有href
属性的所有
元素；跳过没有该属性的元素
BeautifulSoup 3于2012年3月停止开发；新项目确实应该始终使用BeautifulSoup 4
请注意，您应该将从字节解码HTML保留为BeautifulSoup。您可以通知BeautifulSoup在HTTP响应头中找到的字符集以帮助解码，但这可能是错误的，并且与HTML本身中找到的
头信息相冲突，这就是上面使用BeautifulSoup内部类方法的原因EncodingDetector。查找\u声明的\u encoding（）
，以确保此类嵌入的编码提示能够战胜配置错误的服务器
对于请求
，如果响应具有文本/*
mimetype，则response.encoding
属性默认为Latin-1，即使未返回任何字符集。这与HTTP RFCs一致，但与HTML解析一起使用时会很痛苦，因此，如果在上下文中未设置字符集
，则应忽略该属性nt类型标题。
此脚本执行您要查找的操作，但也将相对链接解析为绝对链接
import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

为了找到所有链接，我们将在本例中一起使用urllib2模块
使用re.模块
*re模块中最强大的函数之一是“re.findall（）”。
使用re.search（）查找模式的第一个匹配项时，re.findall（）查找所有匹配项
匹配并将其作为字符串列表返回，每个字符串表示一个匹配*
import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

BeatifulSoup自己的解析器速度可能较慢。使用能够直接从URL解析的可能更可行（下面提到了一些限制）
import lxml.html
doc=lxml.html.parse（url）
links=doc.xpath（'//a[@href]'）
对于链接中的链接：
打印链接.attrib['href']

上面的代码将按原样返回链接，在大多数情况下，它们将是相对链接或来自站点根的绝对链接。由于我的用例仅提取特定类型的链接，下面是一个将链接转换为完整URL的版本，它可以选择接受glob模式，如*.mp3
。它不会处理单d和双d虽然ots在相对路径中，但到目前为止，我还不需要它。如果您需要解析包含。/
或/
的URL片段，那么可能会很方便
注意：直接lxml url解析不处理从https
加载，也不执行重定向，因此以下版本使用urllib2
+lxml

！/usr/bin/env python
导入系统
导入urllib2
导入URL解析
导入lxml.html
导入fnmatch
尝试：
将urltools导入为urltools
除恐怖外：
sys.stderr.write（'要规范化URL，请运行：`pip install urltools--user`）
urltools=None
def get_主机（url）：
p=urlparse.urlparse（url）
返回“{}://{}”。格式（p.scheme，p.netloc）
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
url=sys.argv[1]
主机=获取主机（url）
glob_patt=len（sys.argv）>2和sys.argv[2]或'*'
doc=lxml.html.parse（urllib2.urlopen（url））
links=doc.xpath（'//a[@href]'）
对于链接中的链接：
href=link.attrib['href']
如果fnmatch.fnmatch（href，glob_patt）：
如果不是href.startswith（（'http://'，'https://''ftp://'）：
如果href.startswith（'/'）：
href=host+href
其他：
parent_url=url.rsplit（'/'，1）[0]
href=urlparse.urljoin（父url，href）
如果是URL工具：
href=urlto
import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

# Python 3.
import urllib    
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
    link = line.get('href')
    if not link:
        continue
    if link.startswith('http'):
        external_links.add(link)
    else:
        internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
    urllib.parse.urljoin(url, internal_link) 
    for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
    print(link)