用于解析HTML文档的Python正则表达式_Python_Html_Regex

用于解析HTML文档的Python正则表达式

python html regex

用于解析HTML文档的Python正则表达式,python,html,regex,Python,Html,Regex,我正试图按收入的顺序找到这些公司的名称。这有点挑战性，因为标题都有不同格式的标签。如果有人能想出一个解决办法，我将不胜感激我的问题的一个例子：我想依次匹配“沃尔玛公司”和“中石化集团”等等 <td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td> …在文件

我正试图按收入的顺序找到这些公司的名称。这有点挑战性，因为标题都有不同格式的标签。如果有人能想出一个解决办法，我将不胜感激

我的问题的一个例子：

我想依次匹配“沃尔玛公司”和“中石化集团”等等

<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>

…在文件中进一步

<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>

提前感谢。

将

标记中的

title

属性的内容分组。它检查它是否是排名后的第一个表单元格

regex = /th>\n<td.*?><a .* ?title="(.*?)".*>/

regex=/th>\n有关regex详细信息，请将a
标记中的title
属性的内容分组。它检查它是否是排名后的第一个表单元格
regex = /th>\n<td.*?><a .* ?title="(.*?)".*>/

regex=/th>\n对于regex的详细信息
这可以通过beautifulsoup

from bs4 import BeautifulSoup as soup

x = ['<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>', '<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>']
tmp = [soup(y).find('td').find('a') for y in x]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

从bs4导入BeautifulSoup作为汤
x=[''，]
tmp=[soup（y）.find（'td'）.find（'a'）表示x中的y]
lst=[x['title'].strip（），如果x.has_attr（'title'）]
打印（lst）

如果是单个字符串，则可以使用
x = '''<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td> <td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>'''
tmp = [y.find('a') for y in soup(x).find_all('td')]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

x=''
tmp=[y.find（'a'）表示汤中的y（x）。find_all（'td'）]
lst=[x['title'].strip（），如果x.has_attr（'title'）]
打印（lst）

如果您仍然想使用正则表达式，那么
<td.*?<a.*? title\s*=\s*"([^"]+).*?</td> 

使用beautifulsoup

from bs4 import BeautifulSoup as soup

x = ['<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>', '<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>']
tmp = [soup(y).find('td').find('a') for y in x]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

从bs4导入BeautifulSoup作为汤
x=[''，]
tmp=[soup（y）.find（'td'）.find（'a'）表示x中的y]
lst=[x['title'].strip（），如果x.has_attr（'title'）]
打印（lst）

如果是单个字符串，则可以使用
x = '''<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td> <td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>'''
tmp = [y.find('a') for y in soup(x).find_all('td')]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

x=''
tmp=[y.find（'a'）表示汤中的y（x）。find_all（'td'）]
lst=[x['title'].strip（），如果x.has_attr（'title'）]
打印（lst）

如果您仍然想使用正则表达式，那么
<td.*?<a.*? title\s*=\s*"([^"]+).*?</td> 

首先，您可能不需要正则表达式。第二，看起来它们都是类mw redirect
的锚。。。类似于BeautifulSoup
的东西应该能够基于此选择项目…我知道我应该使用BeautifulSoup
，尽管我需要使用正则表达式。为什么不使用原始数据呢？首先，您可能不需要正则表达式。第二，看起来它们都是类mw redirect
的锚。。。类似于BeautifulSoup
的东西应该能够基于此选择项目……我知道我应该使用BeautifulSoup
，尽管我需要使用正则表达式。为什么不使用原始数据呢？