Python 使用Beautifulsoup查找文本的精确匹配_Python_Html_Beautifulsoup

Python 使用Beautifulsoup查找文本的精确匹配

python html

Python 使用Beautifulsoup查找文本的精确匹配,python,html,beautifulsoup,Python,Html,Beautifulsoup,我想使用beautifulsoup从html中提取文本的精确匹配值。但是我得到了一些与我的精确文本几乎匹配的文本。我的代码是： from bs4 import BeautifulSoup import urllib2enter code here url="http://www.somesite.com" page=urllib2.urlopen(url) soup=BeautifulSoup(page,"lxml") for elem in soup(text=re.compile("exa

我想使用beautifulsoup从html中提取文本的精确匹配值。但是我得到了一些与我的精确文本几乎匹配的文本。我的代码是：

from bs4 import BeautifulSoup
import urllib2enter code here
url="http://www.somesite.com"
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"lxml")
for elem in soup(text=re.compile("exact text")):
   print elem

对于上述代码，输出如下所示：

1.exact text
2.almost exact text

如何使用beautifulsoup仅获得精确匹配？

注意：变量（elem）应为

类型

您可以使用

标记和任何属性
值在汤中搜索所需元素
即：此代码将搜索所有a
元素，其id
等于some\u id\u值

然后它将循环
找到的每个元素，测试它是否为。text
值等于“精确文本”

如果是这样，它将打印整个元素

for elem in soup.find_all('a', {'id':'some_id_value'}):
    if elem.text == "exact text":
        print(elem)

为此使用BeautifulSoup
的find_all
方法及其字符串
参数
作为一个例子，这里我解析了维基百科上关于牙买加一个地方的一个小页面。我查找所有文本为“牙买加存根”的字符串，但我希望只找到一个。当我找到它时，我会显示文本及其父对象
>>> url = 'https://en.wikipedia.org/wiki/Cassava_Piece'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for item in soup.find_all(string="Jamaica stubs"):
...     item
...     item.findParent()
... 
'Jamaica stubs'
<a href="/wiki/Category:Jamaica_stubs" title="Category:Jamaica stubs">Jamaica stubs</a>

我在正则表达式中使用了IGNORECASE
，这样就可以在维基百科的文章中找到“Women”和“Women”。我在for
循环中使用enumerate
，以便对显示的项目进行编号，使其更易于阅读。
感谢您的帮助。。上述代码不适用于我soup.find_all（string=“Jamaica stubs”）：
不返回任何内容。您最好提供一个或一些您试图搜索的HTML示例。我认为我在第二个版本中提供了一个改进。对于elem in soup.find_all（text=“Tullus”）：print elem工作正常感谢您的回复…我只想搜索文本出现的位置，不使用任何标记。。
>>> url = 'https://en.wikipedia.org/wiki/Hockey'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> import re
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for i, item in enumerate(soup.find_all(string=re.compile('women', re.IGNORECASE))):
...     i, item.findParent().text[:100]
... 
(0, "Women's Bandy World Championships")
(1, "The governing body is the 126-member International Hockey Federation (FIH). Men's field hockey has b")
(2, 'The governing body of international play is the 77-member International Ice Hockey Federation (IIHF)')
(3, "women's")