在python中查找不带BeautifulSoup的页面超链接_Python_Regex_Web Scraping

在python中查找不带BeautifulSoup的页面超链接

python regex web-scraping

在python中查找不带BeautifulSoup的页面超链接,python,regex,web-scraping,Python,Regex,Web Scraping,我想做的是在这里找到一个网页的所有超链接，这是我到目前为止所拥有的，但它不起作用 from urllib.request import urlopen def findHyperLinks(webpage): link = "Not found" encoding = "utf-8" for webpagesline in webpage: webpagesline = str(webpagesline, encoding) if "&l

我想做的是在这里找到一个网页的所有超链接，这是我到目前为止所拥有的，但它不起作用

from urllib.request import urlopen

def findHyperLinks(webpage):
    link = "Not found"
    encoding = "utf-8"
    for webpagesline in webpage:
        webpagesline = str(webpagesline, encoding)
        if "<a href>" in webpagesline:
            indexstart = webpagesline.find("<a href>")
            indexend = webpagesline.find("</a>")
            link = webpagesline[indexstart+7:indexend]
            return link
    return link

def main():
    address = input("Please enter the adress of webpage to find the hyperlinks")
    try:
        webpage = urlopen(address)
        link =  findHyperLinks(webpage)
        print("The hyperlinks are", link)

        webpage.close()
    except Exception as exceptObj:
        print("Error:" , str(exceptObj))

main()

从urllib.request导入urlopen
def findHyperLinks（网页）：
link=“未找到”
encoding=“utf-8”
对于网页中的网页行：
webpagesline=str（webpagesline，编码）
如有")
link=webpage[indexstart+7:indexend]
返回链接
返回链接
def main（）：
地址=输入（“请输入网页地址以查找超链接”）
尝试：
webpage=urlopen（地址）
link=findHyperLinks（网页）
打印（“超链接为”，链接）
网页关闭（）
例外情况除外，作为例外情况：
打印（“错误：”，str（exceptObj））
main（）

您的代码中存在多个问题。其中一个原因是，您正在尝试查找带有present、empty和唯一一个

href

属性的链接：

：
没有BeautifulSoap，您可以使用RegExp和simple函数
from urllib.request import urlopen
import re

def find_link(url):
    response = urlopen(url)
    res = str(response.read())
    my_dict = re.findall('(?<=<a href=")[^"]*', res)

    for x in my_dict:
        # simple skip page bookmarks, like #about
        if x[0] == '#':
            continue

        # simple control absolute url, like /about.html
        # also be careful with redirects and add more flexible
        # processing, if needed
        if x[0] == '/':
            x = url + x

        print(x)

find_link('http://cnn.com')

从urllib.request导入urlopen
进口稀土
def查找链接（url）：
响应=urlopen（url）
res=str（response.read（））
my_dict=re.findall（'（？我不允许在此代码中使用BeautifulSoup打开一个web浏览器，导航到一个页面，右键单击，查看源代码。然后按住Ctrl+F并搜索。这是你的问题之一。不，我只能使用url open我们在课堂上没有讨论xpath关于正则表达式吗？我不确定课堂上的教授给了我们什么这是一个示例指南，但他没有找到超链接，而是向我们展示了如何找到不同的页面标题@JonathonReinhart，当我这样做时，它没有显示任何内容。谢谢这确实有所帮助
from urllib.request import urlopen
import re

def find_link(url):
    response = urlopen(url)
    res = str(response.read())
    my_dict = re.findall('(?<=<a href=")[^"]*', res)

    for x in my_dict:
        # simple skip page bookmarks, like #about
        if x[0] == '#':
            continue

        # simple control absolute url, like /about.html
        # also be careful with redirects and add more flexible
        # processing, if needed
        if x[0] == '/':
            x = url + x

        print(x)

find_link('http://cnn.com')