Python BeautifulSoup查找子目录_Python_Beautifulsoup

Python BeautifulSoup查找子目录

python

Python BeautifulSoup查找子目录,python,beautifulsoup,Python,Beautifulsoup,我正在尝试找出如何在Python中使用BeautifulSoup在网页上查找子目录。我有一个想法，我会怎么做。这就是我的想法： from bs4 import BeautifulSoup html = '''<a href="/images/pic.png">images</a> <a href="google.com">google</a>''' soup = BeautifulSoup(html) links = soup.find_all

我正在尝试找出如何在Python中使用BeautifulSoup在网页上查找子目录。我有一个想法，我会怎么做。这就是我的想法：

from bs4 import BeautifulSoup

html = '''<a href="/images/pic.png">images</a>
<a href="google.com">google</a>'''

soup = BeautifulSoup(html)
links = soup.find_all('a', href=True)
for link in links:
    print a['href']

从bs4导入美化组
html=“”
'''
soup=BeautifulSoup（html）
links=soup.find_all（'a'，href=True）
对于链接中的链接：
打印['href']

上面会在一个页面上发布所有链接。我如何让它像示例“/images/pic.png”中那样只打印子目录

尽管我想使用beautifulsoup，但使用任何其他模块都可以。

如果为

a['href']

添加一个

条件，例如假设子目录路径中至少有两个/
，您可以使用a['href'].count（'/'）>=2
作为条件
样本：
from bs4 import BeautifulSoup
html = '''<a href="/images/pic.png">images</a>
<a href="google.com">google</a>'''

soup = BeautifulSoup(html)
links = soup.find_all('a', href=True)
for link in links:
    if a['href'].count('/') >= 2:
        print a['href']

从bs4导入美化组
html=“”
'''
soup=BeautifulSoup（html）
links=soup.find_all（'a'，href=True）
对于链接中的链接：
如果['href'].count（'/'）>=2：
打印['href']

如果“子目录”指的是相对路径，则可以使用a['href'].startswith（'/'）
作为条件。
为a['href']
添加如果条件，例如假设子目录路径中至少有两个/
，则可以使用a['href'].count（'/'））>=2
作为条件
样本：
from bs4 import BeautifulSoup
html = '''<a href="/images/pic.png">images</a>
<a href="google.com">google</a>'''

soup = BeautifulSoup(html)
links = soup.find_all('a', href=True)
for link in links:
    if a['href'].count('/') >= 2:
        print a['href']

从bs4导入美化组
html=“”
'''
soup=BeautifulSoup（html）
links=soup.find_all（'a'，href=True）
对于链接中的链接：
如果['href'].count（'/'）>=2：
打印['href']

如果“子目录”指的是相对路径，则可以使用a['href'].startswith（'/'）
作为条件。
您可以解析url以提取目录路径：
import posixpath
import urlparse
from bs4 import BeautifulSoup

html = '<a href="/images/pic.png">images</a><a href="google.com">google</a>'
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
    dirpath = posixpath.dirname(urlparse.urlparse(a['href']).path)
    if dirpath and dirpath != '/':
       print dirpath #NOTE: urllib.unquote_plus() may introduce `/`

导入posixpath
导入URL解析
从bs4导入BeautifulSoup
html=“”
soup=BeautifulSoup（html）
对于汤中的a.find_all（'a'，href=True）：
dirpath=posixpath.dirname（urlparse.urlparse（a['href']）.path）
如果dirpath和dirpath！='/'：
print dirpath#注意：urllib.unquote_plus（）可能会引入`/`
您可以解析url以提取目录路径：
import posixpath
import urlparse
from bs4 import BeautifulSoup

html = '<a href="/images/pic.png">images</a><a href="google.com">google</a>'
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
    dirpath = posixpath.dirname(urlparse.urlparse(a['href']).path)
    if dirpath and dirpath != '/':
       print dirpath #NOTE: urllib.unquote_plus() may introduce `/`

导入posixpath
导入URL解析
从bs4导入BeautifulSoup
html=“”
soup=BeautifulSoup（html）
对于汤中的a.find_all（'a'，href=True）：
dirpath=posixpath.dirname（urlparse.urlparse（a['href']）.path）
如果dirpath和dirpath！='/'：
print dirpath#注意：urllib.unquote_plus（）可能会引入`/`
如果a['href'].count（'/'）>=2和a['href'].startswith（'/'）：
只是为了确保它实际上是一个相对路径，您可能需要将其更改为。.count（'/'）>=2
找到
（它不应该找到）并且错过了
（它应该找到）您可能想要将其更改为如果a['href'].count（'/'））>=2和a['href'].startswith（'/'）：
只是为了确保它实际上是一个相对路径。.count（'/'）>=2
找到
（它不应该找到它）并错过
（它应该找到它）