在python中查找不带BeautifulSoup的页面超链接

在python中查找不带BeautifulSoup的页面超链接,python,regex,web-scraping,Python,Regex,Web Scraping,我想做的是在这里找到一个网页的所有超链接,这是我到目前为止所拥有的,但它不起作用 from urllib.request import urlopen def findHyperLinks(webpage): link = "Not found" encoding = "utf-8" for webpagesline in webpage: webpagesline = str(webpagesline, encoding) if "&l

我想做的是在这里找到一个网页的所有超链接,这是我到目前为止所拥有的,但它不起作用

from urllib.request import urlopen

def findHyperLinks(webpage):
    link = "Not found"
    encoding = "utf-8"
    for webpagesline in webpage:
        webpagesline = str(webpagesline, encoding)
        if "<a href>" in webpagesline:
            indexstart = webpagesline.find("<a href>")
            indexend = webpagesline.find("</a>")
            link = webpagesline[indexstart+7:indexend]
            return link
    return link

def main():
    address = input("Please enter the adress of webpage to find the hyperlinks")
    try:
        webpage = urlopen(address)
        link =  findHyperLinks(webpage)
        print("The hyperlinks are", link)

        webpage.close()
    except Exception as exceptObj:
        print("Error:" , str(exceptObj))

main()
从urllib.request导入urlopen
def findHyperLinks(网页):
link=“未找到”
encoding=“utf-8”
对于网页中的网页行:
webpagesline=str(webpagesline,编码)
如有")
link=webpage[indexstart+7:indexend]
返回链接
返回链接
def main():
地址=输入(“请输入网页地址以查找超链接”)
尝试:
webpage=urlopen(地址)
link=findHyperLinks(网页)
打印(“超链接为”,链接)
网页关闭()
例外情况除外,作为例外情况:
打印(“错误:”,str(exceptObj))
main()

您的代码中存在多个问题。其中一个原因是,您正在尝试查找带有present、empty和唯一一个
href
属性的链接:


没有BeautifulSoap,您可以使用RegExp和simple函数

from urllib.request import urlopen
import re

def find_link(url):
    response = urlopen(url)
    res = str(response.read())
    my_dict = re.findall('(?<=<a href=")[^"]*', res)

    for x in my_dict:
        # simple skip page bookmarks, like #about
        if x[0] == '#':
            continue

        # simple control absolute url, like /about.html
        # also be careful with redirects and add more flexible
        # processing, if needed
        if x[0] == '/':
            x = url + x

        print(x)

find_link('http://cnn.com')
从urllib.request导入urlopen
进口稀土
def查找链接(url):
响应=urlopen(url)
res=str(response.read())

my_dict=re.findall('(?我不允许在此代码中使用BeautifulSoup打开一个web浏览器,导航到一个页面,右键单击,查看源代码。然后按住Ctrl+F并搜索
。这是你的问题之一。不,我只能使用url open我们在课堂上没有讨论xpath关于正则表达式吗?我不确定课堂上的教授给了我们什么这是一个示例指南,但他没有找到超链接,而是向我们展示了如何找到不同的页面标题@JonathonReinhart,当我这样做时,它没有显示任何内容。谢谢这确实有所帮助
from urllib.request import urlopen
import re

def find_link(url):
    response = urlopen(url)
    res = str(response.read())
    my_dict = re.findall('(?<=<a href=")[^"]*', res)

    for x in my_dict:
        # simple skip page bookmarks, like #about
        if x[0] == '#':
            continue

        # simple control absolute url, like /about.html
        # also be careful with redirects and add more flexible
        # processing, if needed
        if x[0] == '/':
            x = url + x

        print(x)

find_link('http://cnn.com')