Python 查找并用正则表达式替换文本的漂亮汤'；不在<；a></a>'；_Python_Html_Find_Beautifulsoup

Python 查找并用正则表达式替换文本的漂亮汤'；不在<；a></a>'；

python html

Python 查找并用正则表达式替换文本的漂亮汤'；不在<；a></a>'；,python,html,find,beautifulsoup,Python,Html,Find,Beautifulsoup,我正在使用BeautifulSoup解析html，以查找所有需要的文本 1.不包含在任何锚定元件内我想出了这段代码，它可以找到href中的所有链接，但不是相反如何使用BeautifulSoup修改此代码以仅获取纯文本，以便查找、替换和修改Soup for a in soup.findAll('a',href=True): print a['href'] 编辑： <html><body> <div> <a href="www.test1.c

我正在使用BeautifulSoup解析html，以查找所有需要的文本

1.不包含在任何锚定元件内

我想出了这段代码，它可以找到href中的所有链接，但不是相反

如何使用BeautifulSoup修改此代码以仅获取纯文本，以便查找、替换和修改Soup

for a in soup.findAll('a',href=True):
    print a['href']

编辑：

<html><body>
 <div> <a href="www.test1.com/identify">test1</a> </div>
 <div><br></div>
 <div><a href="www.test2.com/identify">test2</a></div>
 <div><br></div><div><br></div>
 <div>
   This should be identified 

   Identify me 1 

   Identify me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
 </div>
</body></html>

This should be identified 
Identify me 1 
Identify me 2
This paragraph should be identified.

示例：

<html><body>
 <div> <a href="www.test1.com/identify">test1</a> </div>
 <div><br></div>
 <div><a href="www.test2.com/identify">test2</a></div>
 <div><br></div><div><br></div>
 <div>
   This should be identified 

   Identify me 1 

   Identify me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
 </div>
</body></html>

This should be identified 
Identify me 1 
Identify me 2
This paragraph should be identified.

我执行此操作是为了查找不在







这一点应当加以确定
重拍我1
取代我2
应识别此段落

谢谢你的时间

如果我理解您的意思是正确的，那么您希望获取包含href属性的a元素中的文本。如果要获取元素的文本，可以使用

.text

属性

>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('<a href="http://something.com">this is some text</a>')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'

返回的对象类型为

BeautifulSoup.NavigableString

。如果要检查父元素是否为

元素，可以执行

txt.parent.name==“a”

另一编辑：

下面是另一个带有正则表达式和替换项的示例

import BeautifulSoup
import re

soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
 <div> <a href="www.test1.com/identify">test1</a> </div>
 <div><br></div>
 <div><a href="www.test2.com/identify">test2</a></div>
 <div><br></div><div><br></div>
 <div>
   This should be identified 

   Identify me 1 

   Identify me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
 </div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
    if re.search('identi',txt,re.I) and txt.parent.name != 'a':
        newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
        txt.replaceWith(newtext)
print(soup)


<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br /></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br /></div><div><br /></div>
<div>
   this should be replacefied 

   replacefy me 1 

   replacefy me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>

导入美化组
进口稀土
soup=beautifulsou.beautifulsou（）
html=“”





这一点应当加以确定
确认我的身份1
确认我的身份2
应识别此段落
'''
soup.feed（html）
对于soup.findAll中的txt（text=True）：
如果重新搜索（'identi'，txt，re.I）和txt.parent.name！='a'：
newtext=re.sub（r'identi（\w+），r'replace\1'，txt.lower（））
替换为（新文本）
印花（汤）





这应该被替换掉
替换我1
替换我2
应替换此段落

html没有

href

标记，锚元素（

）有

href

属性。你的问题不清楚，你能提供你想要实现的前后例子吗？我想他想要得到的可能是链接（“http://...“等）从纯文本，当他们不是内部链接。可能的目标是一个内容过滤器，将链接文本链接。或者再次，重读第二点，也许他的意思是找到不包含文本链接的文本？肯定需要更多信息。@Chris Morgan我添加了一个符合我要求的示例…我想他想获得可能是链接的内容（“http://...“等）而不是在链接中使用。@DiggyF我想用另一种方法来做你做的事情。我现在添加了一个示例来说明这一点clear@DiggyF如果找到单词，如何在当前html本身中执行替换操作？此外，即使单词“identified”在标记中，此代码也会找到它。我可以使用正则表达式来匹配所需的模式而不是“identified”吗？谢谢。。