Python/Beautifulsoup:HTML当前元素的路径_Python_Beautifulsoup

Python/Beautifulsoup:HTML当前元素的路径

python

Python/Beautifulsoup:HTML当前元素的路径,python,beautifulsoup,Python,Beautifulsoup,对于一个班级项目，我正在提取网页上的所有链接。这就是我目前所拥有的 from bs4 import BeautifulSoup, SoupStrainer with open("input.htm") as inputFile: soup = BeautifulSoup(inputFile) outputFile=open('output.txt', 'w') for link in soup.find_all('a', href=True): outputFile.write(st

对于一个班级项目，我正在提取网页上的所有链接。这就是我目前所拥有的

from bs4 import BeautifulSoup, SoupStrainer

with open("input.htm") as inputFile:
    soup = BeautifulSoup(inputFile)

outputFile=open('output.txt', 'w')
for link in soup.find_all('a', href=True):
outputFile.write(str(link)+'\n')
outputFile.close()

这很有效

复杂的是：对于每个

元素，我的项目要求我知道当前链接的整个“树结构”。换句话说，我想知道所有前面的元素，从

元素开始。以及沿途的

类

和

id

类似于Windows资源管理器上的导航页。或者许多浏览器的元素检查工具上的导航面板

例如，如果您查看维基百科上的圣经页面和指向塔木德的维基百科页面的链接，下面的“路径”就是我要查找的

<body class="mediawiki ...>
 <div id="content" class="mw-body" role="main">
  <div id="bodyContent" class="mw-body-content">
   <div id="mw-content-text" ...>
    <div class="mw-parser-output">
     <div role="navigation" ...>
      <table class="nowraplinks ...>
       <tbody>
        <td class="navbox-list ...>
         <div style="padding:0em 0.25em">
          <ul>
           <li>
            <a href="/wiki/Talmud"


请尝试以下代码：
soup = BeautifulSoup(inputFile, 'html.parser')

或者使用lxml：
soup = BeautifulSoup(inputFile, 'lxml')

如果未安装：
pip install lxml

请尝试以下代码：
soup = BeautifulSoup(inputFile, 'html.parser')

或者使用lxml：
soup = BeautifulSoup(inputFile, 'lxml')

如果未安装：
pip install lxml

这是我刚刚写的一个解决方案。它的工作原理是找到元素，然后通过元素父级在树上导航。我只解析开始标记并将其添加到列表中。把清单倒过来。最后，我们将得到一个类似于您请求的树的列表
我为一个元素编写了它，您可以修改它以使用find_all
from bs4 import BeautifulSoup
import requests

page = requests.get("https://en.wikipedia.org/wiki/Bible")
soup = BeautifulSoup(page.text, 'html.parser')

tree = []

hrefElement = soup.find('a', href=True)
hrefString = str(hrefElement).split(">")[0] + ">"
tree.append(hrefString)

hrefParent = hrefElement.find_parent()
while (hrefParent.name != "html"):
    hrefString = str(hrefParent).split(">")[0] + ">"
    tree.append(hrefString)
    hrefParent = hrefParent.find_parent()

tree.reverse()
print(tree)

这是我刚刚写的一个解决方案。它的工作原理是找到元素，然后通过元素父级在树上导航。我只解析开始标记并将其添加到列表中。把清单倒过来。最后，我们将得到一个类似于您请求的树的列表
我为一个元素编写了它，您可以修改它以使用find_all
from bs4 import BeautifulSoup
import requests

page = requests.get("https://en.wikipedia.org/wiki/Bible")
soup = BeautifulSoup(page.text, 'html.parser')

tree = []

hrefElement = soup.find('a', href=True)
hrefString = str(hrefElement).split(">")[0] + ">"
tree.append(hrefString)

hrefParent = hrefElement.find_parent()
while (hrefParent.name != "html"):
    hrefString = str(hrefParent).split(">")[0] + ">"
    tree.append(hrefString)
    hrefParent = hrefParent.find_parent()

tree.reverse()
print(tree)

您好，亲爱的Alex-非常感谢您添加了使用lxml进行此操作的想法-；）您好，亲爱的Alex-非常感谢您添加了使用lxml进行此操作的想法-；）您好，亲爱的斯里兰卡-非常感谢您提供的解决方案-看起来非常有趣。我得到了“，”，“，”，“，”，“，”，“，”，“，”，“，”，“，”，“，”，”，“，”，[在4.645s内完成]是的，输出应该是这样的。有什么我做错了吗？您好，亲爱的斯里兰卡-非常感谢您提供的解决方案-看起来很有趣。我得到了“，”，“，”，“，”，“，”，“，”，“，”，“，”，“，”，“，”，”，“，”，[在4.645s内完成]是的，输出应该是这样的。我做错什么了吗？