Python 在BeautifulSoupSelect中突出显示结果源_Python_Beautifulsoup

Python 在BeautifulSoupSelect中突出显示结果源

python

Python 在BeautifulSoupSelect中突出显示结果源,python,beautifulsoup,Python,Beautifulsoup,这是我的代码： import bs4 from soupselect import select soup = bs4.BeautifulSoup('<body><p>text</p></body>') res = select(soup,'p') 将以流的形式打印：元素在源文本和长度处的偏移量使用soupselect或beautifulsou都无法完成所需的操作；Soup树不跟踪原始源偏移；HTML解析器在构建树时不会传递此信息此外，树生

这是我的代码：

import bs4
from soupselect import select

soup = bs4.BeautifulSoup('<body><p>text</p></body>')
res = select(soup,'p')

将以流的形式打印：

元素在源文本和长度处的偏移量

使用

soupselect

或beautifulsou都无法完成所需的操作；Soup树不跟踪原始源偏移；HTML解析器在构建树时不会传递此信息

此外，树生成器修复损坏的HTML；

html5lib

解析器将在需要时插入缺少的HTML元素，如

、

和

元素

您不应将

soupselect

项目与beautifulsoup4一起使用；它是为版本3设计的。相反，使用CSS选择元素。

您想要的东西既不能用

soupselect

也不能用BeautifulSoup；Soup树不跟踪原始源偏移；HTML解析器在构建树时不会传递此信息

此外，树生成器修复损坏的HTML；

html5lib

解析器将在需要时插入缺少的HTML元素，如

、

和

元素

您不应将

soupselect

项目与beautifulsoup4一起使用；它是为版本3设计的。相反，使用CSS选择元素。

我理解您的目的，但正如Martijn Pieters所说，这不是BeautifulSoup跟踪的东西

也就是说，您可以使用标准的python功能来实现这一点，但只需给出一些约束条件：您要查找的标记应该是唯一的（或者您应该在string对象的

find

方法中添加一个偏移量。此外，您应该考虑到损坏的HTML将得到尽可能好的修复，因此，如果原始HTML被损坏，不要期望良好的匹配

>>> import bs4
>>> soup = bs4.BeautifulSoup('<body><p>text</p></body>')
>>> print(repr(soup))  # remark that new tags have been added!
<html><body><p>text</p></body></html>
>>> first_p = repr(soup.find('p'))  # it is now a string, no longer a tag
>>> repr(soup).find(first_p) # This will give the result, taking into account the newly added tags
12
>>> repr(soup).find(first_p) -6 # because the "<html>" tag was added automatically
6
>>> len(first_p)
11

导入bs4 >>>soup=bs4.BeautifulSoup（“text

”） >>>打印（repr（soup））#注意添加了新标签！正文

>>>first_p=repr（soup.find（'p'））#它现在是一个字符串，不再是一个标记 >>>repr（soup）.find（first_p）#考虑到新添加的标签，这将给出结果 12 >>>repr（soup）.find（first_p）-6，因为“”标记是自动添加的 6. >>>len（第一组） 11

但这确实对您正在寻找的标签造成了很大的限制。不过，这应该给您一个开始。

我理解您的目的，但正如Martijn Pieters所说，这不是BeautifulSoup所关注的

也就是说，您可以使用标准的python功能来实现这一点，但只需给出一些约束条件：您要查找的标记应该是唯一的（或者您应该在string对象的

find

方法中添加一个偏移量。此外，您应该考虑到损坏的HTML将得到尽可能好的修复，因此，如果原始HTML被损坏，不要期望良好的匹配

>>> import bs4
>>> soup = bs4.BeautifulSoup('<body><p>text</p></body>')
>>> print(repr(soup))  # remark that new tags have been added!
<html><body><p>text</p></body></html>
>>> first_p = repr(soup.find('p'))  # it is now a string, no longer a tag
>>> repr(soup).find(first_p) # This will give the result, taking into account the newly added tags
12
>>> repr(soup).find(first_p) -6 # because the "<html>" tag was added automatically
6
>>> len(first_p)
11

导入bs4 >>>soup=bs4.BeautifulSoup（“text

”） >>>打印（repr（soup））#注意添加了新标签！正文

但是，这确实对您正在寻找的标记造成了很大的限制。但这应该给您一个开始。

您是如何计算出偏移量的数字5的？您显然没有包括括号？而且您知道BeautifulSoup将向HTML字符串添加更多标记？我的错误是偏移量为6。您是如何计算出t的偏移量的数字是5？显然你没有包括括号？你知道BeautifulSoup会给你的HTML字符串添加更多标记吗？我的错误是偏移量是6。我正在寻找一种方法，可以在特定的jquery选择中提取常规站点HTML的源。我正在寻找一种方法，可以提取常规站点HTML的源特定jquery选择中的html

>>> import bs4
>>> soup = bs4.BeautifulSoup('<body><p>text</p></body>')
>>> print(repr(soup))  # remark that new tags have been added!
<html><body><p>text</p></body></html>
>>> first_p = repr(soup.find('p'))  # it is now a string, no longer a tag
>>> repr(soup).find(first_p) # This will give the result, taking into account the newly added tags
12
>>> repr(soup).find(first_p) -6 # because the "<html>" tag was added automatically
6
>>> len(first_p)
11