Python 如何使用beautifulsoup查找html中两个元素之间的距离_Python_Html_Beautifulsoup

Python 如何使用beautifulsoup查找html中两个元素之间的距离

python html

Python 如何使用beautifulsoup查找html中两个元素之间的距离,python,html,beautifulsoup,Python,Html,Beautifulsoup,目标是使用BeautifulSoup查找两个标记之间的距离，例如第一个外部a href属性和标题标记 html = '<title>stackoverflow</title><a href="https://stackoverflow.com">test</a>' soup = BeautifulSoup(html) ext_link = soup.find('a',href=re.compile("^https?:",re.IGNORECASE)

目标是使用BeautifulSoup查找两个标记之间的距离，例如第一个外部a href属性和标题标记

html = '<title>stackoverflow</title><a href="https://stackoverflow.com">test</a>'
soup = BeautifulSoup(html)
ext_link = soup.find('a',href=re.compile("^https?:",re.IGNORECASE))
title = soup.title
dist = abs_distance_between_tags(ext_link,title)
print dist
30

html='stackoverflow'
soup=BeautifulSoup（html）
ext_link=soup.find（'a'，href=re.compile（“^https？：”，re.IGNORECASE））
title=soup.title
距离=标签之间的绝对距离（外部链接，标题）
打印区
30

如果不使用正则表达式，我将如何执行此操作

请注意，标记的顺序可能不同，并且可能有多个匹配项（尽管我们使用find（）只取第一个）

我在BeautifulSoup中找不到返回匹配项html中的位置/位置的方法。

如您所述，您似乎无法在BeautifulSoup中获得元素的确切字符位置

html = '<title>stackoverflow</title><a href="https://stackoverflow.com">test</a>'
soup = BeautifulSoup(html)
ext_link = soup.find('a',href=re.compile("^https?:",re.IGNORECASE))
title = soup.title
dist = abs_distance_between_tags(ext_link,title)
print dist
30

也许可以帮助你：

另外，lxml只提供源代码行，这是不够的。Cf:

解析器找到的原始行号，如果未知，则为无。

但是expat在文件CurrentByteIndex中提供了确切的偏移量

从start_元素处理程序获取，它返回标记的start（即“4现在支持
```
tag.sourceline
```
和
```
tag.sourcepos
```
）

参考资料：
什么类型的“距离”？它们之间的标记数？取决于您希望如何准确地计算它们。屏幕上的像素数？取决于您所针对的浏览器。还有什么？对不起，我指的是它们之间的字符串数，在示例中是在https中stackoverflow中的s到h的位置。我不理解您对正则表达式的厌恶。您可以
```
ext\link=soup.find（lambda x:x.name==“a”和（x[“href”].startswith（“http:”）或x[“href”].startswith（“https:”）））
```
，但这更难看，更不灵活。通常用正则表达式解析HTML/XML很糟糕，但在这种情况下，这可能是最好的选择。