Python 提取html标记之间的非数字字符_Python_Regex

Python 提取html标记之间的非数字字符

python regex

Python 提取html标记之间的非数字字符,python,regex,Python,Regex,我一直在尝试从下面提取数字字符以下的单词，但没有成功： <div class="text">hello there 234 44</div> 你好234 44 以下是我正在做的： regex_name = re.compile(r'<div class="text">([^\d].+)</div>') regex_name=re.compile（r'（[^\d].+））您可能需要使用断言 (?<=">)[^\d]+ ^^^^^^

我一直在尝试从下面提取数字字符以下的单词，但没有成功：

<div class="text">hello there 234 44</div>

你好234 44

以下是我正在做的：

regex_name = re.compile(r'<div class="text">([^\d].+)</div>')

regex_name=re.compile（r'（[^\d].+））

您可能需要使用断言

(?<=">)[^\d]+
^^^^^^^

（？您可能需要使用断言
(?<=">)[^\d]+
^^^^^^^

（？作为起点，我将在HTML输入中查找所需的元素并提取元素的文本
然后，我将使用获取字符串中的所有字符，直到满足一个数字：
In [1]: from itertools import takewhile

In [2]: from bs4 import BeautifulSoup

In [3]: data = """<div class="text">hello there 234 44</div>"""

In [4]: soup = BeautifulSoup(data, "html.parser")

In [5]: text = soup.find("div", class_="text").get_text()

In [6]: ''.join(takewhile(lambda x: not x.isdigit(), text))
Out[6]: u'hello there '

[1]中的：从itertools导入takewhile
在[2]中：来自bs4导入BeautifulSoup
在[3]：data=“”你好234 44”“”
在[4]中：soup=beautifulsou（数据，“html.parser”）
在[5]中：text=soup.find（“div”，class=“text”）.get\u text（）
在[6]中：''.join（takewhile（lambda x:not x.isdigit（），text））
出[6]：你好
作为起点，我将使用在HTML输入中找到所需的元素并提取元素的文本
然后，我将使用获取字符串中的所有字符，直到满足一个数字：
In [1]: from itertools import takewhile

In [2]: from bs4 import BeautifulSoup

In [3]: data = """<div class="text">hello there 234 44</div>"""

In [4]: soup = BeautifulSoup(data, "html.parser")

In [5]: text = soup.find("div", class_="text").get_text()

In [6]: ''.join(takewhile(lambda x: not x.isdigit(), text))
Out[6]: u'hello there '

[1]中的：从itertools导入takewhile
在[2]中：来自bs4导入BeautifulSoup
在[3]：data=“”你好234 44”“”
在[4]中：soup=beautifulsou（数据，“html.parser”）
在[5]中：text=soup.find（“div”，class=“text”）.get\u text（）
在[6]中：''.join（takewhile（lambda x:not x.isdigit（），text））
出[6]：你好
你正在抓取一个站点吗？如果是的话，有一些工具可以在没有regex的情况下查找信息。这看起来很危险，就像使用一样。尝试将其作为一个起点？假设我想学习如何在regex中执行此操作：）假设这在一般情况下是不可能的。HTML是不规则的。你是在抓取网站吗？如果是这样的话，有一些工具可以在没有正则表达式的情况下查找信息。这看起来很危险，就像使用一样。试着将其作为一个起点？假设我想学习如何使用正则表达式：）假设在一般情况下是不可能的。HTML是不规则的。如果没有itertools数据='hello there 234 44'来自bs4导入美化组（''）。连接（[i代表美化组中的i（数据，'HTML'）。如果不是i，则获取_text（）。isdigit（）]），而没有itertools数据='hello there 234 44'来自bs4导入美化组（''）。连接（[i代表美化组中的i（数据，'HTML'））.get_text（）如果不是i.isdigit（）]）太棒了！！现在我明白了：）太棒了！！现在我明白了：）
In [1]: from itertools import takewhile

In [2]: from bs4 import BeautifulSoup

In [3]: data = """<div class="text">hello there 234 44</div>"""

In [4]: soup = BeautifulSoup(data, "html.parser")

In [5]: text = soup.find("div", class_="text").get_text()

In [6]: ''.join(takewhile(lambda x: not x.isdigit(), text))
Out[6]: u'hello there '