Python 在大字符串中搜索文件路径。返回文件路径+；文件名_Python_Regex_String_Beautifulsoup_Html Parsing

Python 在大字符串中搜索文件路径。返回文件路径+；文件名

python regex string

Python 在大字符串中搜索文件路径。返回文件路径+；文件名,python,regex,string,beautifulsoup,html-parsing,Python,Regex,String,Beautifulsoup,Html Parsing,我有一个小项目，我想从网页下载一系列壁纸。我是python新手我使用的是urllib库，它返回一长串网页数据，其中包括 <a href="http://website.com/wallpaper/filename.jpg"> 如何在页面源代码中搜索这部分文本，并返回图像链接的其余部分，以“*.jpg”扩展名结尾 r'http://website.com/wallpaper/ xxxxxx .jpg' 我在想我是否可以格式化一个正则表达式，其中xxxx部分不被计算？只需检查路径和

我有一个小项目，我想从网页下载一系列壁纸。我是python新手

我使用的是

urllib

库，它返回一长串网页数据，其中包括

<a href="http://website.com/wallpaper/filename.jpg">

如何在页面源代码中搜索这部分文本，并返回图像链接的其余部分，以“*.jpg”扩展名结尾

r'http://website.com/wallpaper/ xxxxxx .jpg'

我在想我是否可以格式化一个正则表达式，其中xxxx部分不被计算？只需检查路径和.jpg扩展名。然后在找到匹配项后返回整个字符串

我走对了吗

我认为一个非常基本的正则表达式就可以了。
比如：

如果使用

$1

，将返回整个字符串

如果你使用

(http:\/\/website\.com\/wallpaper\/([\w\d_-]*?)\.jpg)

然后，

$1

将给出整个字符串，

$2

将只给出文件名

注意：转义（

\/

）依赖于语言，因此请使用python支持的内容。

对于这类事情非常方便

import re
import urllib3
from bs4 import BeautifulSoup

jpg_regex = re.compile('\.jpg$')
site_regex = re.compile('website\.com\/wallpaper\/')

pool = urllib3.PoolManager()
request = pool.request('GET', 'http://your_website.com/')
soup = BeautifulSoup(request)

jpg_list = list(soup.find_all(name='a', attrs={'href':jpg_regex}))
site_list = list(soup.find_all(name='a', attrs={'href':site_regex}))

result_list = map(lambda a: a.get('href'), jpg_list and site_list)

不要对HTML使用正则表达式

相反，请使用HTML解析库

是用于解析HTML的库，是用于获取URL的内置模块

import urllib2
from bs4 import BeautifulSoup as bs

content = urllib2.urlopen('http://website.com/wallpaper/index.html').read()
html = bs(content)
links = [] # an empty list

for link in html.find_all('a'):
   href = link.get('href')
   if '/wallpaper/' in href:
      links.append(href)

在url中搜索“”子字符串，然后在url中检查“.jpg”，如下所示：

domain = "http://website.com/wallpaper/"
url = str("your URL")
format = ".jpg"
for domain in url and format in url:
    //do something

您可以使用

regex

，但不能使用。也许

BeautifulSoup

谢谢您的回复。两张票投给beautifulsoup，我想我得去看看。我在谷歌fu中发现了它，但不知道如何使用iThanks进行响应。两张票投给beautifulsoup，我想我得去看看。我在谷歌fu中发现了它，但不知道如何使用它

import urllib2
from bs4 import BeautifulSoup as bs

content = urllib2.urlopen('http://website.com/wallpaper/index.html').read()
html = bs(content)
links = [] # an empty list

for link in html.find_all('a'):
   href = link.get('href')
   if '/wallpaper/' in href:
      links.append(href)

domain = "http://website.com/wallpaper/"
url = str("your URL")
format = ".jpg"
for domain in url and format in url:
    //do something