在python中使用正则表达式获取多个重复行_Python_Regex

在python中使用正则表达式获取多个重复行

python regex

在python中使用正则表达式获取多个重复行,python,regex,Python,Regex,我对RegEx非常陌生，有一个非常大的文本文件，其中一小部分如下所示： <div class="hbk-preamble " id="preamble-APG5180"> <div class="hbk-preamble-entry"> <div class="hbk-preamble-icon hbk-preamble-icon_mode"></div> <p class="hbk-preamble-heading">Offered&

我对RegEx非常陌生，有一个非常大的文本文件，其中一小部分如下所示：

<div class="hbk-preamble " id="preamble-APG5180">
<div class="hbk-preamble-entry">
<div class="hbk-preamble-icon hbk-preamble-icon_mode"></div>
<p class="hbk-preamble-heading">Offered</p>
<p><a href="index-bylocation-city-melbourne.html">City (Melbourne)</a></p><ul class="hbk-preamble-list__offerings"><li>Summer semester A 2019 (Flexible)</li></ul><p><a href="index-bylocation-clayton.html">Clayton</a></p><ul class="hbk-preamble-list__offerings"><li>First semester 2019 (On-campus)</li></ul>
</div>
</div>
<div class="notes">
<p class="hbk-heading hdg_6">Notes</p>
<p></p><ul>
<li>The unit may be offered as part of the <a class="hbk-screen-url" href="http://www.monash.edu/students/courses/arts/summer-program.html">Summer Arts Program</a><span class="hbk-print-url">Summer Arts Program (<a href="http://www.monash.edu/students/courses/arts/summer-program.html">http://www.monash.edu/students/courses/arts/summer-program.html</a>)</span>.</li>
<li>For more information please visit the <a class="hbk-screen-url" href="https://www.anzsog.edu.au/">ANZSOG webpage</a><span class="hbk-print-url">ANZSOG webpage (<a href="https://www.anzsog.edu.au/">https://www.anzsog.edu.au/</a>)</span>.</li>
</ul>
</div>
<h2 class="hbk-heading">Synopsis</h2>
<div>
<p>The media is one of the most important components of any political society. In a liberal democracy like Australia, its role and function have profound implications for the conduct of politics, the nature of democracy and public policy outcomes. In this unit, the relationship between the media, politics and public policy is studied from three broad perspectives. First, the politics of the media is investigated from the perspective of liberal democratic theory in order to understand the role of news media on the policy debate. Second, the political economy of the media is investigated. Particular emphasis is on the structure and operation of media organisations and journalists and how political news is covered. Third, the unit undertakes a study of the relationship between the media and political actors. Particular emphasis is on the use of public relations and 'spin doctors' in managing the media as well as the utilisation of political advertising and strategic political communication by governments and political agents.</p>
</div>
<h2 class="hbk-heading">Outcomes</h2>
<div>
<p>Upon successful completion of the unit students should have:</p>
<ol princestart="0" start="1" type="1">

我需要为文本文件中的每个部分输出概要文本，我应该怎么做

到目前为止，我已经使用read和readlines读取了文本文件，但我无法建立一个开始的模式

首先，我不会直接回答你的问题。我想你的问题是个问题。在你的例子中，你必须处理HTML，因此你有很多功能强大的工具

看看Python的BeautifulSoup：

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

从这个

汤中

你可以提取你需要的任何东西

现在从您的问题开始，如果您仍然想使用正则表达式，您可以使用它来帮助您：

演示：

首先，我不会直接回答你的问题。我想你的问题是个问题。在你的例子中，你必须处理HTML，因此你有很多功能强大的工具
看看Python的BeautifulSoup：
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

从这个汤中
你可以提取你需要的任何东西
现在从您的问题开始，如果您仍然想使用正则表达式，您可以使用它来帮助您：
演示：
我建议使用beautifulsoup软件包来实现这一点。您可以尝试以下方法：
import requests
from bs4 import BeautifulSoup
data = requests.get('put website address here')
soup = BeautifulSoup(data.text, 'html.parser')
for i in soup.find_all('h2', {'class':'hbk-heading'}):
    print(i.text.strip())

我建议使用beautifulsoup软件包来实现这一点。您可以尝试以下方法：
import requests
from bs4 import BeautifulSoup
data = requests.get('put website address here')
soup = BeautifulSoup(data.text, 'html.parser')
for i in soup.find_all('h2', {'class':'hbk-heading'}):
    print(i.text.strip())

请研究如何在Python中使用XML/HTML解析器，我相信Python本身就支持XML/HTML解析器。使用正则表达式解析HTML通常是有害的。如果你继续这样做的话，很多小猫都会死的。请研究一下在Python中使用XML/HTML解析器，我相信它本就支持它们。使用正则表达式解析HTML通常是有害的。如果你继续这样做，很多小猫都会死。我试过了，但我有一个txt文件，没有链接？我被要求使用正则表达式我试过了，但是我有一个txt文件而不是一个链接？我被要求为此使用正则表达式