Python Can'；t使用regex获取一些内容_Python_Regex_Python 3.x_Web Scraping_Beautifulsoup

Python Can'；t使用regex获取一些内容

python regex python-3.x web-scraping

Python Can'；t使用regex获取一些内容,python,regex,python-3.x,web-scraping,beautifulsoup,Python,Regex,Python 3.x,Web Scraping,Beautifulsoup,我已经在python中编写了一些代码，与BeautifulSoup结合使用，以在br标记中获得一些地址。如果只使用BeautifulSoup解析所需的文本，我可以使用。next_sibling完成这项工作，我已经在下面展示了它。我的意图是结合BeautifulSoup和re来删除br中的内容这是我迄今为止的尝试： import re from bs4 import BeautifulSoup content = """ <div class="store""> <b>

我已经在

python

中编写了一些代码，与

BeautifulSoup

结合使用，以在

br

标记中获得一些

地址。如果只使用BeautifulSoup
解析所需的文本，我可以使用。next_sibling
完成这项工作，我已经在下面展示了它。我的意图是结合BeautifulSoup
和re
来删除br
中的内容
这是我迄今为止的尝试：
import re
from bs4 import BeautifulSoup

content = """
<div class="store"">
<b>address</b><br>BLOCK ANG MO KIO AVE<br>
<b>address_one</b><br>BLOCK 407 ANG MO KIO AVE 10 #01-741<br>
<b>address_two</b><br>NO. 53 ANG MO KIO AVE 3 AMK HUB#B1-82<br>
</div>
"""
# soup = BeautifulSoup(content,"lxml")
# for addr in soup.find_all("b"):
#     print(addr.next_sibling.next_sibling)

soup = BeautifulSoup(content,"lxml")
for addr in soup.find_all(text=re.compile(r"<br>(.*?)</br>")):
    print(addr)  #It prints nothing, no error either

重新导入
从bs4导入BeautifulSoup
content=”“”
如果您想使用正则表达式，可以尝试以下操作：
for addr in re.findall(r"<br>(.*?)<br>", content):
    print(addr)

如您所见，text
仅表示标记的内部文本，而不是“查看”“br
标记，使用text
是不正确的。除此之外，它只获取内部文本中包含一些文本的整个节点，而不会从中提取任何子字符串。这将提取这些节点，因为它们将完全匹配：soup.find_all（text=re.compile（r“^[A-Z0-9#.-]+（？：\s+[-.#A-Z0-9]+）*$”）
没有结束标记br
只是一个换行符-它不应该有结束标记作为答案提供。你建议的方法非常有效@Wiktor Stribiżew.@Wiktor Stribiżew，我无法理解刚才使用的*
。因为在非捕获组中已经有一个+
。量词影响组中模式的整个序列。否则，它将只匹配ABC DEF
字符串。通过*
，它与ABC
，ABC DEF
，A-C B.F H#D
字符串匹配。您一定是想使用惰性版本，r“
（.*？
”。
BLOCK ANG MO KIO AVE
BLOCK 407 ANG MO KIO AVE 10 #01-741
NO. 53 ANG MO KIO AVE 3 AMK HUB#B1-82