Python 如果你是一个在美丽的汤里的兄弟姐妹，我就找不到了_Python_Beautifulsoup

Python 如果你是一个在美丽的汤里的兄弟姐妹，我就找不到了

python

Python 如果你是一个在美丽的汤里的兄弟姐妹，我就找不到了,python,beautifulsoup,Python,Beautifulsoup,我有一些我正试图解析的HTLML，它的格式几乎没有类标识符，所以我几乎没有什么东西可以利用。看起来有点像这样： <h3>I am an important section of the list</h3> <ul> <li><a href="commonStuff/newThing1">Important text in here</a></li> <li><a href="co

我有一些我正试图解析的HTLML，它的格式几乎没有类标识符，所以我几乎没有什么东西可以利用。看起来有点像这样：

<h3>I am an important section of the list</h3>
<ul>
    <li><a href="commonStuff/newThing1">Important text in here</a></li>
    <li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
    ...
</ul>
<h3>I am another section of the list but I am not important</h3>
<ul>
    <li><a href="I look like I could be important">Cool looking info in here></li>
    <li><a href="I look like I could be important">Cool looking info in here></li>  
</ul>

问题是我不知道接下来该做什么，因为在那一点上，我正在寻找标题标签后面的东西。我看到的唯一方法是使用某种get children函数。因此，我正在这样做：

for body in section.next_siblings:

这有两件坏事

后面应该只有一个嘶嘶声。我真的不明白在什么情况下会有多个
我无法对body中的链接执行
```
。查找所有（“a”）：
```
，因为兄弟姐妹与我之前解析的原始html汤不同

如果href链接和

标记中的文本直接位于我关心的

标记下，您建议如何访问href链接和

标记中的文本

这里的问题似乎是，我希望内容直接位于

标记之后。如果我能以某种方式将文档按这些标记之间的内容进行拆分，那就太好了。

下一个兄弟姐妹

没有复数形式存在，以查找第一个下一个兄弟姐妹：

res = []
sections = part.find_all('h3', 
                         string=lambda s:'I am an important section of the list' in s)
for section in sections:
    for item in section.next_sibling.next_sibling.find_all('a'):
        res.append(item.get('href'))

print(res)

>>>['commonStuff/newThing1', 'commonStuff/newThing2']

关于
下一个兄弟姐妹的解释
：

如果html源代码在

之后不包含换行符，则只需要一个

下一个同级

。美苏将其解释为

在第一个示例中，我们得到了换行符：

html = """
<h3>I am an important section of the list</h3>
<ul>
    <li><a href="commonStuff/newThing1">Important text in here</a></li>
    <li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
 """
soup = soup(html, 'html.parser')

sections = soup.find_all('h3')
for section in sections:
    print('next sibling : ', section.next_sibling)
    print(type(section.next_sibling))

html=”“”
我是名单上重要的一部分




"""
soup=soup（html，'html.parser'）
区段=汤。查找所有（'h3'））
对于分段：
打印（'next sibling:'，section.next_sibling）
打印（类型（节下一个兄弟）

结果:

next sibling :  

<class 'bs4.element.NavigableString'>

next sibling :  <ul>
<li><a href="commonStuff/newThing1">Important text in here</a></li>
<li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
<class 'bs4.element.Tag'>

下一个兄弟姐妹：

在本例中，在

之后没有换行符，我们直接获得正在搜索的标记：

html = """
<h3>I am an important section of the list</h3><ul>
    <li><a href="commonStuff/newThing1">Important text in here</a></li>
    <li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
 """
soup = soup(html, 'html.parser')

sections = soup.find_all('h3')
for section in sections:
    print('next sibling : ', section.next_sibling)
    print(type(section.next_sibling))

html=”“”
我是名单中的一个重要部分



"""
soup=soup（html，'html.parser'）
区段=汤。查找所有（'h3'））
对于分段：
打印（'next sibling:'，section.next_sibling）
打印（类型（节下一个兄弟）

结果:

next sibling :  

<class 'bs4.element.NavigableString'>

next sibling :  <ul>
<li><a href="commonStuff/newThing1">Important text in here</a></li>
<li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
<class 'bs4.element.Tag'>

下一个兄弟姐妹：

完美！Item.text为我获取括号中的内容如何访问href链接？请解释为什么只有在调用下一个兄弟姐妹两次时才定义find_all（）？由于某种原因，如果您只做一次，它就没有定义。我认为这与你的“复数”术语有关。它与html源代码和换行符有关，我添加了一些示例