Python 如何在两个正则表达式模式之间获取纯文本_Python_Regex

Python 如何在两个正则表达式模式之间获取纯文本

python regex

Python 如何在两个正则表达式模式之间获取纯文本,python,regex,Python,Regex,我想提取的文本内容… 请注意，&“是它们的html等价物：（）以下是我编写的代码： import re response='[<td width="100%" class="wrapText device-width" valign="top" style="overflow: hidden; border-collapse: collapse !important; border-spacing: 0 !impo

我想提取

的文本内容…

请注意，

<>&“

是它们的html等价物

：（

）

以下是我编写的代码：


import re
 response='[<td width="100%" class="wrapText device-width" valign="top" style="overflow: hidden; border-collapse: collapse !important; border-spacing: 0 !important; border: none; display: inline-block; max-width:600px;"><h3 style="font-family: Helvetica, Arial, sans-serif; font-weight: normal; line-height: 19px; color: #231f20; text-align: left; font-size: 14px; margin: 0 0 2px; font-weight:none;" align="left"><div id="UserInputtedText">Hi Dear ,<br /><br />we hope you enjoy your shopping with us !<br />please leave us a positive feedback on the feedback section on your purchase history<br />You can click the button next to the item and leave a feedback there we will REALLY appreciate that !<br />Have a Great Day &amp;amp; STAY SAFE !</div></h3>]'

pattern 1= (\w+\s\w+[=][&]\w+[;]\w+[&]\w+[;][&]\w+[;])
# this is pattern 1 : div id="UserInputtedText">

pattern 2 =([&]\w+[;][/]\w+[&]\w+[;][&]\w+[;][/]\w+[&]\w+[;])
# this is pattern 2 : </div></h3>

pattern=re.search(r'(\w+\s\w+[=][&]\w+[;]\w+[&]\w+[;][&]\w+[;])(.*)([&]\w+[;][/]\w+[&]\w+[;][&]\w+[;][/]\w+[&]\w+[;])',response)

print(pattern.group(2))


进口稀土
回复=“[您好，亲爱的，

我们希望您与我们一起购物愉快！
请在您的购买历史记录的反馈部分给我们一个积极的反馈
您可以单击该商品旁边的按钮，并在那里留下反馈，我们将非常感谢您的反馈！
祝您度过愉快的一天，并保持安全！”
模式1=（\w+\s\w+[=][&]\w+[；]\w+[&]\w+[；][&]\w+[；][&]\w+[；]）
#这是模式1:div id=“UserInputedText”>
模式2=（[&]\w+[；][/]\w+[&]\w+[；][&]\w+[；][/]\w+[&]\w+[；]））
#这是模式2：
pattern=re.search（r'（\w+\s\w+[=][&]\w+[；]\w+[&]\w+[；][&]\w+[；]]（.*）（[&]\w+[；][/]\w+[&]\w+[；][&]\w+[；][&]\w+[；]][/]\w+[；]），响应）
打印（图案组（2））

有两种方法可以实现这一点：

您试图解析的是HTML，这超出了正则表达式的能力（请参阅）。请改用HTML解析库之一，如BeautifulSoup

你不关心HTML，你知道它总是有这个表单，可能是因为它是从模板生成的。在这种情况下，你可以使用类似

r'div id=“userinputedtext”>（.*）

导入html，重新
m=重新搜索（r'div id=“UserInputedText”>（.*），响应）
如果m为无：
…处理情况。。。
text=html.unescape（m.group（1）.replace（'
'，'\n'））

原则上，使用HTML解析器是更好的解决方案。在实践中，当进行web抓取时，在任何情况下都不能保证具有

id=“userinputedtext”

的元素始终具有相同的id（除非您与另一方达成某种协议），在这一点上，大多数优势都会消失

如果你要对网页进行大量处理，BeautifulSoup仍然是一个优势，因为它更容易避免意外地匹配你想要的内容以外的内容。但是，如果只有一个网页，那么即使是更简单、更不容易损坏的网页也很不错。

，请使用类似BeautifulSoup的东西而不是正则表达式

import html, re
m = re.search(r'div id="UserInputtedText">(.*)</div></h3>', response)
if m is None:
    ... handle the situation ...
text = html.unescape(m.group(1).replace('<br />', '\n'))