Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/301.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/url/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在Python中从HTML页面提取URL_Python_Url_Web Crawler - Fatal编程技术网

如何在Python中从HTML页面提取URL

如何在Python中从HTML页面提取URL,python,url,web-crawler,Python,Url,Web Crawler,我必须用Python编写一个网络爬虫。我不知道如何解析页面并从HTML中提取URL。我应该去哪里学习编写这样一个程序 换句话说,是否有一个简单的python程序可以用作通用web爬虫的模板?理想情况下,它应该使用相对简单的模块,并且应该包含大量的注释来描述每行代码正在做什么 看看下面的示例代码。该脚本提取web页面(此处为Python主页)的html代码,并提取该页面中的所有链接。希望这有帮助 #!/usr/bin/env python import requests from bs4 imp

我必须用Python编写一个网络爬虫。我不知道如何解析页面并从HTML中提取URL。我应该去哪里学习编写这样一个程序


换句话说,是否有一个简单的python程序可以用作通用web爬虫的模板?理想情况下,它应该使用相对简单的模块,并且应该包含大量的注释来描述每行代码正在做什么

看看下面的示例代码。该脚本提取web页面(此处为Python主页)的html代码,并提取该页面中的所有链接。希望这有帮助

#!/usr/bin/env python

import requests
from bs4 import BeautifulSoup

url = "http://www.python.org"
response = requests.get(url)
# parse html
page = str(BeautifulSoup(response.content))


def getURL(page):
    """

    :param page: html of web page (here: Python home page) 
    :return: urls in that page 
    """
    start_link = page.find("a href")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1: end_quote]
    return url, end_quote

while True:
    url, n = getURL(page)
    page = page[n:]
    if url:
        print(url)
    else:
        break
输出:

/
#left-hand-navigation
#content-body
/search
/about/
/news/
/doc/
/download/
/getit/
/community/
/psf/
http://docs.python.org/devguide/
/about/help/
http://pypi.python.org/pypi
/download/releases/2.7.3/
http://docs.python.org/2/
/ftp/python/2.7.3/python-2.7.3.msi
/ftp/python/2.7.3/Python-2.7.3.tar.bz2
/download/releases/3.3.0/
http://docs.python.org/3/
/ftp/python/3.3.0/python-3.3.0.msi
/ftp/python/3.3.0/Python-3.3.0.tar.bz2
/community/jobs/
/community/merchandise/
/psf/donations/
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/Languages
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.psfmember.org

使用解析页面,查看模块。它使用简单,允许您使用HTML解析页面。只需执行
str.find('a')

您可以使用。按照文档进行操作,查看哪些符合您的要求。该文档还包含有关如何提取URL的代码片段

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

soup.find_all('a') # Finds all hrefs from the html doc.
导入系统 进口稀土 导入urllib2 导入URL解析 tocrawl=集合([“http://www.facebook.com/"]) 爬网=设置([]) keywordregex=re.compile('您可以使用前面所述的方法。它可以解析HTML、XML等。要查看它的一些功能,请参阅

例如:

import urllib2
from bs4 import BeautifulSoup
url = 'http://www.google.co.in/'

conn = urllib2.urlopen(url)
html = conn.read()

soup = BeautifulSoup(html)
links = soup.find_all('a')

for tag in links:
    link = tag.get('href',None)
    if link is not None:
        print link
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.google.co.in/'

conn = urllib2.urlopen(url)
html = conn.read()

soup = BeautifulSoup(html)
links = soup.find_all('a')

for tag in links:
    link = tag.get('href',None)
    if link is not None:
        print link