python—如何获取所有<；p>；使用beautifulsoup的网页中某个文本前的标记？_Python_Html_Parsing_Beautifulsoup_Web Crawler

python—如何获取所有<；p>；使用beautifulsoup的网页中某个文本前的标记？

python html parsing web-crawler

python—如何获取所有<；p>；使用beautifulsoup的网页中某个文本前的标记？,python,html,parsing,beautifulsoup,web-crawler,Python,Html,Parsing,Beautifulsoup,Web Crawler,我的网站有很多标签。我希望所有的标记都写在网页中某个唯一文本之前。我怎样才能做到这一点 p1 p2 p3 certain unique text p4 p5</p&g

我的网站有很多标签。我希望所有的

标记都写在网页中某个唯一文本之前。我怎样才能做到这一点

<p>p1</p>
<p>p2</p>
<p>p3</p>
<span class="zls" id=".B1.D9.87.D8.A7.DB.8C_.D9.88.D8.A"> certain unique text </span>
<p>p4</p>
<p>p5</p>

p1
p2
p3
某些独特的文本
p4
p5

因此，我想要得到[p1、p2、p3]的列表，但我不想要p4和p5。

只有在所有以前的同级不包含特定文本时，才可以使用in

find_all

来选择“p”标记，例如：

html = '''
<p>p1</p>
<p>p2</p> 
<p>p3</p>
<span class="zls" id=".B1.D9.87.D8.A7.DB.8C_.D9.88.D8.A"> certain unique text </span>
<p>p4</p>
<p>p5</p>
'''
soup = BeautifulSoup(html, 'html.parser')

def select_tags(tag, text='certain unique text'):
    return tag.name=='p' and all(text not in t.text for t in tag.find_previous_siblings())

print(soup.find_all(select_tags))

html=''
p1
p2
p3
某些独特的文本
p4
p5
'''
soup=BeautifulSoup（html，'html.parser'）
def select_标记（标记，文本=‘某些唯一文本’）：
return tag.name=='p'和all（t中没有文本。t中的文本表示t中的t。find_previous_hibides（））
打印（soup.find_all（选择标签））

[p1
，p2
，p3
]

仅当“p”标记的所有以前的同级标记不包含特定文本时，才可以使用in

find_all

选择“p”标记，例如：

html = '''
<p>p1</p>
<p>p2</p> 
<p>p3</p>
<span class="zls" id=".B1.D9.87.D8.A7.DB.8C_.D9.88.D8.A"> certain unique text </span>
<p>p4</p>
<p>p5</p>
'''
soup = BeautifulSoup(html, 'html.parser')

def select_tags(tag, text='certain unique text'):
    return tag.name=='p' and all(text not in t.text for t in tag.find_previous_siblings())

print(soup.find_all(select_tags))

html=''
p1
p2
p3
某些独特的文本
p4
p5
'''
soup=BeautifulSoup（html，'html.parser'）
def select_标记（标记，文本=‘某些唯一文本’）：
return tag.name=='p'和all（t中没有文本。t中的文本表示t中的t。find_previous_hibides（））
打印（soup.find_all（选择标签））

[p1
，p2
，p3
]

除了t.m.adam爵士已经展示的内容外，您还可以这样做，从类

zls

前面出现的

标记中提取文本：

from bs4 import BeautifulSoup

html_content = '''
<t>p0</t>
<y>p00</y> 
<p>p1</p>
<p>p2</p> 
<p>p3</p>
<span class="zls" id=".B1.D9.87.D8.A7.DB.8C_.D9.88.D8.A"> certain unique text </span>
<p>p4</p>
<p>p5</p>
'''
soup = BeautifulSoup(html_content, 'lxml')

for items in soup.select(".zls"):
    tag_items = [item.text for item in items.find_previous_siblings() if item.name=="p"]
    print(tag_items)

除了t.m.adam爵士已经展示的内容外，您还可以这样做，从类

zls

前面出现的

标记中提取文本：

from bs4 import BeautifulSoup

html_content = '''
<t>p0</t>
<y>p00</y> 
<p>p1</p>
<p>p2</p> 
<p>p3</p>
<span class="zls" id=".B1.D9.87.D8.A7.DB.8C_.D9.88.D8.A"> certain unique text </span>
<p>p4</p>
<p>p5</p>
'''
soup = BeautifulSoup(html_content, 'lxml')

for items in soup.select(".zls"):
    tag_items = [item.text for item in items.find_previous_siblings() if item.name=="p"]
    print(tag_items)