Python 匹配HTML标记之间的所有内容_Python_Regex_Python 2.7_Web Scraping_Multiline

Python 匹配HTML标记之间的所有内容

python regex python-2.7 web-scraping

Python 匹配HTML标记之间的所有内容,python,regex,python-2.7,web-scraping,multiline,Python,Regex,Python 2.7,Web Scraping,Multiline,我需要匹配html标记之间的所有内容，或者如果有其他方法，则从标记中获取所有信息以下是数据示例： stuff here Changes in the taxicab and for- hire vehicle industries have resulted in increased competition and have h

我需要匹配html标记之间的所有内容，或者如果有其他方法，则从标记中获取所有信息

以下是数据示例：

<B>stuff here</B>

<B>Changes in the taxicab and <FONT STYLE="white-space:nowrap">for-
hire</FONT>  vehicle industries have resulted in increased competition and  
have had a material adverse effect on our business, financial condition, and 
operations.  </B>


medallions. </P> <P STYLE="margin-top:12pt; margin-bottom:0pt; text-indent:4%; font-size:10pt; font-family:Times New Roman"><B>We borrow money, which magnifies the potential for gain or loss on amounts invested, and may increase the risk of investing in us. </B></P>

这里的东西
的士及的士的更改-
出租汽车行业已导致竞争加剧和
对我们的业务、财务状况和
操作。
奖章
 我们借钱，这会放大投资金额的损益可能性，并可能增加投资我们的风险

以下是我需要从这个小街区获得的匹配：

<B>stuff here</B>

<B>Changes in the taxicab and <FONT STYLE="white-space:nowrap">for-
hire</FONT>  vehicle industries have resulted in increased competition and  
have had a material adverse effect on our business, financial condition, and 
operations.  </B>

<B>We borrow money, which magnifies the potential for gain or loss on amounts invested, and may increase the risk of investing in us. </B>

这里的东西
的士及的士的更改-
出租汽车行业已导致竞争加剧和
对我们的业务、财务状况和
操作。
我们借钱，这放大了投资金额的收益或损失潜力，并可能增加投资我们的风险。

下面是我尝试过的几个正则表达式，它们都没有达到我想要的效果：

re.compile("<[Bb]>[\!\@\#\$\%\^\&\*\(\)\_\+\-\=\,\.\/\<\?\:\"\;\'\{\}\[\]\|\\\w\d\s]*<\/[Bb]>", re.MULTILINE)
re.compile("<[Bb]>.+<\/[Bb]>", re.MULTILINE)

re.compile（“[\！\@\\\$\%\^\&\*\（\）\\\+-\=\，\.\/\您可以使用以下模式匹配
标记之间的所有内容：
 (?s)(?<=<B>).*(?=<\/B>)

（？s）（？也许这就是您需要的：事实上，由于html允许在标记中嵌套标记（例如，div中的div）（chomsky类型3语法），它不能由正则表达式（chomsky类型2语法）/@Jacquesdhooge解析，我已经研究过了，但我正在解析非结构化HTM（不是html）我使用的更高级的正则表达式只从网站中提取粗体文本，因为它是整个页面上唯一的标识符。我还将HTML内容加载到文本文件中以删除缩进HTM和HTML@HFBrowning很高兴知道