Python BeautifulSoup4查找所有非嵌套匹配项_Python_Python 3.x_Web Scraping_Recursive Datastructures

Python BeautifulSoup4查找所有非嵌套匹配项

python python-3.x web-scraping

Python BeautifulSoup4查找所有非嵌套匹配项,python,python-3.x,web-scraping,recursive-datastructures,Python,Python 3.x,Web Scraping,Recursive Datastructures,在html文档中设置与我的查询匹配的所有最外面的元素的简单搜索时，我遇到了麻烦。我在这里询问，希望有一个简单的bs4函数可以做到这一点，但它似乎不是这样考虑下面的html示例，其中我希望所有最外层的都有“想要的”类（我希望得到一个2的列表）：导入bs4 text=”“” 我想要这个。我不想那样！我也想要这个。 """ soup=bs4.BeautifulSoup（文本“lxml”） # 1. 一次尝试 fetched=soup.findAll（'div'，class='wanted'）

在html文档中设置与我的查询匹配的所有最外面的元素的简单搜索时，我遇到了麻烦。我在这里询问，希望有一个简单的
bs4
函数可以做到这一点，但它似乎不是这样
考虑下面的html示例，其中我希望所有最外层的
都有
“想要的”
类（我希望得到一个2的列表）：

导入bs4 text=”“” 我想要这个。我不想那样！我也想要这个。 """ soup=bs4.BeautifulSoup（文本“lxml”） # 1. 一次尝试 fetched=soup.findAll（'div'，class='wanted'）打印（len（提取））#3 fetched=soup.findAll（'div'，class='wanted'，recursive=False）打印（len（提取））#0 fetched=soup.findChildren（'div'，class='wanted'）打印（len（提取））#3 fetched=soup.findChildren（'div'，class='wanted'，recursive=False）打印（len（提取））#0 # 2. 一个接一个地试获取=[] fetched0=soup.find（'div'，class='want'）获取0时： fetched.append（fetched0）后代=列表（fetched0.substands） fetched0=子体[-1]。findNext（'div'，class='wanted'）打印（len（fetched））#2万岁！ # 3. 破坏性方法：如果你不关心这个元素的父元素获取=[] fetched0=soup.find（'div'，class='want'）获取0时： fetched.append（fetched0.extract（）） fetched0=soup.find（'div'，class='want'）打印（len（已提取））
因此，部分
#1.
中没有给出预期的结果。因此，
findAll
和
findChildren
之间有什么区别？？而
findNextSibling
与此处的嵌套无关
现在，第2部分可以工作了，但是为什么需要编写这么多代码呢？难道没有更优雅的解决方案吗？至于第三部分，我想我们必须小心后果

您对此搜索有何建议？我真的找到最近的路了吗？我可以使用CSS select magic吗？
除了其他参数外，您还可以将函数作为参数传递给
find_all
。在它内部，您可以使用find_parents（）进行检查，以确保它没有任何具有相同类的顶级div。使用
find_parents（） def top_most_wanted(tag): children_same_class=tag.find_parents("div", class_="wanted") if len(children_same_class) >0: return False return True soup=BeautifulSoup(text,'html.parser') print(soup.find_all(top_most_wanted,'div',class_="wanted")) 我最后做了以下几点，它的优点是不具有破坏性。另外，我没有时间对它进行基准测试，但我只是希望这可以避免像@Bitto Bennichan答案中那样遍历每个嵌套元素，但这确实是不确定的。不管怎样，它符合我的要求： all_fetched = [] fetched = soup.find('div', class_='wanted') while fetched is not None: all_fetched.append(fetched) try: last = list(fetched.descendants)[-1] except IndexError: break fetched = last.findNext('div', class_='wanted') 如果不想匹配内部div，可以尝试divs=[div for div in soup.find_all（'div'，class='wanted'），如果不是div.findParent（'div'，class='wanted'）] 。这就是你要找的吗？是的，那会给我想要的。但是，我认为有一种方法可以完全避免进入子结构。您可以尝试CSS选择器，该选择器将匹配的子级，如果它适合您：soup.select（'.inner>.want'） all_fetched = [] fetched = soup.find('div', class_='wanted') while fetched is not None: all_fetched.append(fetched) try: last = list(fetched.descendants)[-1] except IndexError: break fetched = last.findNext('div', class_='wanted')