Python 从<；a>；使用BeautifulSoup（位于其他两个标记之间）_Python_Beautifulsoup_Screen Scraping

Python 从<；a>；使用BeautifulSoup（位于其他两个标记之间）

python

Python 从<；a>；使用BeautifulSoup（位于其他两个标记之间）,python,beautifulsoup,screen-scraping,Python,Beautifulsoup,Screen Scraping,请根据以下html代码帮助我解决Python中的一个问题： <h2 class="sectionTitle">One</h2> <div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div> <div><a itemprop="affiliation" href="../../snapshot.as

请根据以下html代码帮助我解决Python中的一个问题：

<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>

为了从附加列表中的

-标记收集链接，我无法复制此内容。我是业余爱好者，喜欢education2.append（elt2.get（“href”））之类的东西，但成功率非常有限。有什么想法吗

谢谢

您可以尝试以下方法：

from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all('a')]

您可以尝试以下方法：

from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all('a')]

改进@Ajax1234的答案；这将仅查找具有

itemprop

属性的标记。看

从bs4导入BeautifulSoup作为汤
l=”“”
一
二
"""
s=汤（l，‘lxml’）
final_text=[s.find_all（“a”，attrs={“itemprop”：“affiliation”}中i的i.text]

改进@Ajax1234的答案；这将仅查找具有

itemprop

属性的标记。看

从bs4导入BeautifulSoup作为汤
l=”“”
一
二
"""
s=汤（l，‘lxml’）
final_text=[s.find_all（“a”，attrs={“itemprop”：“affiliation”}中i的i.text]

你很快就能做你想做的事了。我做了一些改变

这将提供您想要的：

html = '''<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div>
<div>dummy</div>
<h2 class="sectionTitle">Two</h2>'''

soup = BeautifulSoup(html, 'lxml')
texts = []
links = []
for tag in soup.find('h2', text='One').find_next_siblings():
    if tag.name == 'h2':
        break
    a = tag.find('a', itemprop='affiliation', href=True, text=True)
    if a:
        texts.append(a.text)
        links.append(a['href'])

print(texts, links, sep='\n')

我添加了一个没有子标记的伪

标记，以表明代码在任何其他情况下都不会失败

如果HTML没有任何带有

itemprop=“affiliation”

标记的

标记，您可以直接使用：

texts = [x.text for x in soup.find_all('a', itemprop='affiliation', text=True)]
links = [x['href'] for x in soup.find_all('a', itemprop='affiliation', href=True)]

你想做什么就做什么。我做了一些改变

这将提供您想要的：

html = '''<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div>
<div>dummy</div>
<h2 class="sectionTitle">Two</h2>'''

soup = BeautifulSoup(html, 'lxml')
texts = []
links = []
for tag in soup.find('h2', text='One').find_next_siblings():
    if tag.name == 'h2':
        break
    a = tag.find('a', itemprop='affiliation', href=True, text=True)
    if a:
        texts.append(a.text)
        links.append(a['href'])

print(texts, links, sep='\n')

我添加了一个没有子标记的伪

标记，以表明代码在任何其他情况下都不会失败

如果HTML没有任何带有

itemprop=“affiliation”

标记的

标记，您可以直接使用：

texts = [x.text for x in soup.find_all('a', itemprop='affiliation', text=True)]
links = [x['href'] for x in soup.find_all('a', itemprop='affiliation', href=True)]

我解决你问题的方法如下：

from bs4 import BeautifulSoup
html = '''
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
'''
soup = BeautifulSoup(html, "html.parser")

# Extract the texts
result1 = [i.text.strip('\n') for i in soup.find_all('div')]
print(result1)

# Extract the HREF links
result2 = [j['href'] for j in soup.find_all('a',href=True)]
print(result2)

希望这个解决方案能解决问题

我解决您问题的方法如下：

from bs4 import BeautifulSoup
html = '''
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
'''
soup = BeautifulSoup(html, "html.parser")

# Extract the texts
result1 = [i.text.strip('\n') for i in soup.find_all('div')]
print(result1)

# Extract the HREF links
result2 = [j['href'] for j in soup.find_all('a',href=True)]
print(result2)

希望这个解决方案能解决问题

你能解释一下链接需要满足哪些条件才能获得文本吗。你能通过

itemprop=“affiliation”

属性获取链接吗？还是必须是在div之后？itemprop=“affiliation”属性可以识别我需要的链接！然后我建议你自己去查阅文档，这会帮助你学习更多。如果你被卡住了，请留下评论。你能解释一下链接需要满足哪些条件才能获得文本吗。你能通过

itemprop=“affiliation”

属性获取链接吗？还是必须是在div之后？itemprop=“affiliation”属性可以识别我需要的链接！然后我建议你自己去查阅文档，这会帮助你学习更多。如果你陷入困境，请留下评论。谢谢你的帮助！不幸的是，我有很多其他的

标签，我不想被刮去，我可能应该添加这些标签。此外，我需要在两个单独的结果列表。谢谢你的帮助！不幸的是，我有很多其他的

标签，我不想被刮去，我可能应该添加这些标签。另外，我需要将结果放在两个单独的列表中。谢谢Oisin，在你告诉我这样做之后，我实际上回到了文档中，我很有帮助：）谢谢Oisin，在你告诉我这样做之后，我实际上回到了文档中，我很有帮助：）

['Text1', 'Text2', 'Text3', 'Two']
['../../snapshot.asp?carId=1230559', '../../snapshot.asp?carId=1648920', '../../snapshot.asp?carId=1207230']