在python中查找不带BeautifulSoup的页面超链接
我想做的是在这里找到一个网页的所有超链接,这是我到目前为止所拥有的,但它不起作用在python中查找不带BeautifulSoup的页面超链接,python,regex,web-scraping,Python,Regex,Web Scraping,我想做的是在这里找到一个网页的所有超链接,这是我到目前为止所拥有的,但它不起作用 from urllib.request import urlopen def findHyperLinks(webpage): link = "Not found" encoding = "utf-8" for webpagesline in webpage: webpagesline = str(webpagesline, encoding) if "&l
from urllib.request import urlopen
def findHyperLinks(webpage):
link = "Not found"
encoding = "utf-8"
for webpagesline in webpage:
webpagesline = str(webpagesline, encoding)
if "<a href>" in webpagesline:
indexstart = webpagesline.find("<a href>")
indexend = webpagesline.find("</a>")
link = webpagesline[indexstart+7:indexend]
return link
return link
def main():
address = input("Please enter the adress of webpage to find the hyperlinks")
try:
webpage = urlopen(address)
link = findHyperLinks(webpage)
print("The hyperlinks are", link)
webpage.close()
except Exception as exceptObj:
print("Error:" , str(exceptObj))
main()
从urllib.request导入urlopen
def findHyperLinks(网页):
link=“未找到”
encoding=“utf-8”
对于网页中的网页行:
webpagesline=str(webpagesline,编码)
如有")
link=webpage[indexstart+7:indexend]
返回链接
返回链接
def main():
地址=输入(“请输入网页地址以查找超链接”)
尝试:
webpage=urlopen(地址)
link=findHyperLinks(网页)
打印(“超链接为”,链接)
网页关闭()
例外情况除外,作为例外情况:
打印(“错误:”,str(exceptObj))
main()
您的代码中存在多个问题。其中一个原因是,您正在尝试查找带有present、empty和唯一一个href
属性的链接::
没有BeautifulSoap,您可以使用RegExp和simple函数
from urllib.request import urlopen
import re
def find_link(url):
response = urlopen(url)
res = str(response.read())
my_dict = re.findall('(?<=<a href=")[^"]*', res)
for x in my_dict:
# simple skip page bookmarks, like #about
if x[0] == '#':
continue
# simple control absolute url, like /about.html
# also be careful with redirects and add more flexible
# processing, if needed
if x[0] == '/':
x = url + x
print(x)
find_link('http://cnn.com')
从urllib.request导入urlopen
进口稀土
def查找链接(url):
响应=urlopen(url)
res=str(response.read())
my_dict=re.findall('(?我不允许在此代码中使用BeautifulSoup打开一个web浏览器,导航到一个页面,右键单击,查看源代码。然后按住Ctrl+F并搜索
。这是你的问题之一。不,我只能使用url open我们在课堂上没有讨论xpath关于正则表达式吗?我不确定课堂上的教授给了我们什么这是一个示例指南,但他没有找到超链接,而是向我们展示了如何找到不同的页面标题@JonathonReinhart,当我这样做时,它没有显示任何内容。谢谢这确实有所帮助
from urllib.request import urlopen
import re
def find_link(url):
response = urlopen(url)
res = str(response.read())
my_dict = re.findall('(?<=<a href=")[^"]*', res)
for x in my_dict:
# simple skip page bookmarks, like #about
if x[0] == '#':
continue
# simple control absolute url, like /about.html
# also be careful with redirects and add more flexible
# processing, if needed
if x[0] == '/':
x = url + x
print(x)
find_link('http://cnn.com')