使用正则表达式python提取标记之间的文本_Python_Regex

使用正则表达式python提取标记之间的文本

python regex

使用正则表达式python提取标记之间的文本,python,regex,Python,Regex,我有下面的文字 <p>FIFA is a non-profit organization which describes itself as an international governing body of association football, fútsal and beach soccer. It is the highest governing body of football.</p>\\n\\n<p><strong>Descrip

我有下面的文字

<p>FIFA is a non-profit organization which describes itself as an international governing body of association football, fútsal and beach soccer. It is the highest governing body of football.</p>\\n\\n<p><strong>Description:</strong><br />\\nFIFA was founded in 1904[3] to oversee international competition among the national associations of Belgium, Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its membership now comprises 211 national associations. These national associations must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America.</p>\\n\\n<p><strong>Motto</strong><strong> </strong><br />\\n For the Game. For the World.
</p>

我需要提取第一段的文本和

我尝试使用*我一直到最后一个*默认情况下，*操作符是贪婪的。您想要的是使它不贪婪。如果您这样做，它应该工作*？

。另外，由于您似乎正在尝试解析html文件，请尝试查看。

*默认情况下，*运算符是贪婪的。您想要的是使其成为非贪婪的。如果您这样做，它应该可以工作。

*？

。另外，由于看起来您正在尝试解析html文件，请尝试查看。

试试这个仅使用原始正则表达式

>>> import re
>>> t = '<p>your string</p>'
>>> re.findall(r'>(.+?)<', t)
['your string']


>>> import re
>>> t = '<b>using b tag</b>'
>>> re.findall(r'>(.+?)<', t)
['using b tag']


>>> import re
>>> t = '<p>1st p</p><p>2nd p</p>'
>>> r = re.compile('<p>(.+?)</p>')
>>> r.findall(t)[0]
'1st p'

希望它能帮助你解决问题。

试试这个仅使用原始正则表达式

>>> import re
>>> t = '<p>your string</p>'
>>> re.findall(r'>(.+?)<', t)
['your string']


>>> import re
>>> t = '<b>using b tag</b>'
>>> re.findall(r'>(.+?)<', t)
['using b tag']


>>> import re
>>> t = '<p>1st p</p><p>2nd p</p>'
>>> r = re.compile('<p>(.+?)</p>')
>>> r.findall(t)[0]
'1st p'

希望它能帮助您解决问题。

您可以使用BeautifulSoup解析html：

import bs4
string = "<p>FIFA is a non-profit organization which describes itself as an international governing body of association football, fútsal and beach soccer. It is the highest governing body of football.</p>\\n\\n<p><strong>Description:</strong><br />\\nFIFA was founded in 1904[3] to oversee international competition among the national associations of Belgium, Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its membership now comprises 211 national associations. These national associations must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America.</p>\\n\\n<p><strong>Motto</strong><strong> </strong><br />\\n For the Game. For the World.</p>"

soup = bs4.BeautifulSoup(string)
soup.find("p").text  # Get the text inside the first p tag
# 'FIFA is a non-profit organization which describes itself as an international governing body of association football, fútsal and beach soccer. It is the highest governing body of football.'

您可以使用BeautifulSoup解析html：

import bs4
string = "<p>FIFA is a non-profit organization which describes itself as an international governing body of association football, fútsal and beach soccer. It is the highest governing body of football.</p>\\n\\n<p><strong>Description:</strong><br />\\nFIFA was founded in 1904[3] to oversee international competition among the national associations of Belgium, Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its membership now comprises 211 national associations. These national associations must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America.</p>\\n\\n<p><strong>Motto</strong><strong> </strong><br />\\n For the Game. For the World.</p>"

soup = bs4.BeautifulSoup(string)
soup.find("p").text  # Get the text inside the first p tag
# 'FIFA is a non-profit organization which describes itself as an international governing body of association football, fútsal and beach soccer. It is the highest governing body of football.'

第一段的非正则表达式解决方案：

s = "<p>FIFA is a non-profit organization which describes itself as an international governing body of association football, fútsal and beach soccer. It is the highest governing body of football.</p>\\n\\n<p><strong>Description:</strong><br />\\nFIFA was founded in 1904[3] to oversee international competition among the national associations of Belgium, Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its membership now comprises 211 national associations. These national associations must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America.</p>\\n\\n<p><strong>Motto</strong><strong> </strong><br />\\n For the Game. For the World.</p>"

print (s.split("<p>")[1].split("</p>")[0])

正则表达式也是这样：

print (re.split("<p>|</p>",s)[1])

第一段的非正则表达式解决方案：

s = "<p>FIFA is a non-profit organization which describes itself as an international governing body of association football, fútsal and beach soccer. It is the highest governing body of football.</p>\\n\\n<p><strong>Description:</strong><br />\\nFIFA was founded in 1904[3] to oversee international competition among the national associations of Belgium, Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its membership now comprises 211 national associations. These national associations must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America.</p>\\n\\n<p><strong>Motto</strong><strong> </strong><br />\\n For the Game. For the World.</p>"

print (s.split("<p>")[1].split("</p>")[0])

正则表达式也是这样：

print (re.split("<p>|</p>",s)[1])

如果文本以其他标记开始，而不是Op想要提取第一段文本，那么这将不起作用。嗨@Poojan，我尝试了使用不同的标记，在第二种情况下，它应该返回null。OP需要第一个标记之间的文本。它不工作。如果文本以另一个标记开始，而不是OP想要提取第一段文本，它将不工作。嗨@Poojan，我尝试使用不同的标记。它工作在第二种情况下，它应返回null。OP需要第一个标签之间的文本。它不起作用。我用它测试过，请将代码添加到您的问题中。*？

检查此项也确保选择python作为我用它测试的语言。请将代码添加到您的问题中。*？

检查此项也确保选择python作为语言