如何在python中从字符串获取url
我使用curl获取网页,并将其存储在python中的一个变量中如何在python中从字符串获取url,python,regex,list,Python,Regex,List,我使用curl获取网页,并将其存储在python中的一个变量中 var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">' 我尝试通过将正则表达式的开头定义为“(htt
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
我尝试通过将正则表达式的开头定义为“(https | http)并将结尾定义为”来匹配正则表达式:
但是我没有得到输出。请帮我做这个,提前谢谢
>>>[]
使用
re.search
import re
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
m = re.search("src=\"(?P<url>.*?)\"", var)
if m:
print m.group('url')
@Manoj,您还可以使用
split()
方法多次检索src
属性的值,如下所示
»使用lambda函数(一行语句)
让我们在多个语句中扩展上述方法,以了解实际的直接过程
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
print(var, "\n")
# <body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">
parts1 = var.split("=")
print(parts1, "\n")
# ['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']
parts2 = parts1[1].split('\"')
print(parts2, "\n")
# ['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']
print(parts2[1])
# https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
var=''
打印(变量“\n”)
#
第1部分=变量拆分(“=”)
打印(第1部分,“\n”)
# ['']
parts2=parts1[1]。拆分(“\”)
打印(第2部分,“\n”)
# ['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop,预览组,hardware _mediump.jpg','style']
打印(第2部分[1])
# https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop,预览组,hardware _mediump.jpg
»输出
E:\Users\Rishikesh\Python3\Practice\Temp>python GetUrls.py
<body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">
['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']
['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
E:\Users\Rishikesh\Python3\Practice\Temp>python GetUrls.py
['']
['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop,预览组,hardware _mediump.jpg','style']
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop,预览组,hardware _mediump.jpg
使用beautifulsoup,您可以搜索a
或img
并检查属性:
例如:
"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg"
from bs4 import BeautifulSoup as soup
var = '<body><a href=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\"><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/></a><div class=\"wrapper\">'
page_soup = soup(var, "html.parser")
links = []
for elm in page_soup.findAll(['a', 'img']):
if elm.has_attr('href'):
links.append(elm.get('href'))
if elm.has_attr('src'):
links.append(elm.get('src'))
print(links)
从bs4导入BeautifulSoup作为汤
var=''
page_soup=soup(变量,“html.parser”)
链接=[]
对于page_soup.findAll(['a','img'])中的榆树:
如果elm.has_attr('href'):
links.append(elm.get('href'))
如果elm.具有_attr('src'):
links.append(elm.get('src'))
打印(链接)
注意:字符串中不会只有一个url,curl可能会在字符串中获取多个url。也许模块请求和/或美化组是适合您的。它们可以很容易地满足您的要求。您需要所有链接吗?即使是指样式表、Javascript、外部链接的链接(是否以
/
和http(s):/”)开头,以及内部链接(绝对与/path/to`和相对与path/to
)类似?谢谢,但开头可能不是“src=\”,可能是“href=\”这样的模式在这种情况下不起作用,对吗?在这种情况下,请使用m=re.search((src | href)=.\”(?P?)“,var)
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
print(var, "\n")
# <body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">
parts1 = var.split("=")
print(parts1, "\n")
# ['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']
parts2 = parts1[1].split('\"')
print(parts2, "\n")
# ['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']
print(parts2[1])
# https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
E:\Users\Rishikesh\Python3\Practice\Temp>python GetUrls.py
<body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">
['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']
['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']
https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
from bs4 import BeautifulSoup as soup
var = '<body><a href=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\"><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/></a><div class=\"wrapper\">'
page_soup = soup(var, "html.parser")
links = []
for elm in page_soup.findAll(['a', 'img']):
if elm.has_attr('href'):
links.append(elm.get('href'))
if elm.has_attr('src'):
links.append(elm.get('src'))
print(links)