Python 使用正则表达式解析XML_Python_Parsing_Beautifulsoup

Python 使用正则表达式解析XML

python parsing

Python 使用正则表达式解析XML,python,parsing,beautifulsoup,Python,Parsing,Beautifulsoup,我想解析一些标记模式是 <div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div> 我以为它管用 re.findall(">"."</a></div>") 但事实并非如此怎么了 ------更新一------- 现在我知道re不适合html 拉杰，给我一个答案 >>> from bs4 import Beauti

我想解析一些标记

模式是

<div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div>

我以为它管用

re.findall(">"."</a></div>")

但事实并非如此

怎么了

------更新一------- 现在我知道re不适合html

拉杰，给我一个答案

>>> from bs4 import BeautifulSoup
>>> s = '<div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div>'
>>> soup = BeautifulSoup(s)
>>> soup.select('div > a:first')[0].text
'What_I_Want'

我还有一个问题。我怎么才能找到

<div id blah blah </div>

在整个文件中？

简短回答：你不能

不同的简短回答：它甚至有一些示例

似乎您试图获取父标记div的直接子标记a的文本

你想达到什么目的？叹气。不要试图用正则表达式解析HTML。丹尼尔·罗斯曼：那怎么搜索那个词呢？伙计，真的谢谢你，我有个问题要问。如何从整个html中搜索？@EfirlusKim更新您的问题。@EfirlusKim如果这是一个格式错误的html文件？开始div标签中的closing>在哪里？有这么多人和@EfirlusKim接受这个问题的答案，然后问一个新问题，并给出您期望的准确输入和输出。

>>> from bs4 import BeautifulSoup
>>> s = '<div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div>'
>>> soup = BeautifulSoup(s)
>>> soup.select('div > a:first')[0].text
'What_I_Want'
>>> soup.select('div > a')[0].text
'What_I_Want'