Python 清除Web垃圾文本的正则表达式_Python_Regex_Python 3.x_Beautifulsoup

Python 清除Web垃圾文本的正则表达式

python regex python-3.x

Python 清除Web垃圾文本的正则表达式,python,regex,python-3.x,beautifulsoup,Python,Regex,Python 3.x,Beautifulsoup,我试图为维基百科页面提取一些信息，我正在使用Beautiful soup将文本加载到Python中，但我似乎很难使用正则表达式剥离所有不必要的标记这是BeautifulSoup的文本输出示例 [<td colspan="3"> </td>, <td valign="top" width="400"> <ul><li><a href="/wiki/Aach,_Baden-W%C3%BCrttemberg" title="Aach,

我试图为维基百科页面提取一些信息，我正在使用Beautiful soup将文本加载到Python中，但我似乎很难使用正则表达式剥离所有不必要的标记

这是BeautifulSoup的文本输出示例

[<td colspan="3">
</td>, <td valign="top" width="400">
<ul><li><a href="/wiki/Aach,_Baden-W%C3%BCrttemberg" title="Aach, Baden-Württemberg">Aach</a> (<a href="/wiki/Baden-W%C3%BCrttemberg" title="Baden-Württemberg">Baden-Württemberg</a>)</li>
<li><a href="/wiki/Aachen" title="Aachen">Aachen</a> (<a href="/wiki/North_Rhine-Westphalia" title="North Rhine-Westphalia">North Rhine-Westphalia</a>)</li>

[
, 
（）
（）

理想情况下，我希望有城市（分配给标题）和地区（就在行尾之前）

任何帮助都将不胜感激

rows = soup.find_all('td')
list_rows = []

#remove html tags
for row in rows:
    cells = row.find_all('li')
    str_cells = str(cells)
    clean = re.compile('<.*?>')
    clean2 = (re.sub(clean, '', str_cells))
    list_rows.append(clean2)
print(clean2)

rows=soup.find_all（'td'））
列表_行=[]
#删除html标记
对于行中的行：
单元格=行。查找所有（'li'））
str_cells=str（cells）
clean=re.compile（“”）
clean2=（re.sub（干净的，，，stru单元格））
列出行。追加（清除2）
打印（清洁2）

这里有两个正则表达式可以执行您想要的操作：

这个正则表达式似乎可以获得所有城镇名称标题属性，但如果城镇名称中有不同的特殊字符，则可能需要进行一些调整。这将捕获空格、破折号和逗号。

title=\”（[\w，-]+\”>[\w]+[^\）]

你可以测试一下

这一个应该可以获得第一个捕获组中的区域名称，尽管关于特殊字符的警告同样适用：

（[\w，-]+）（）？\）

您可以测试它

您可以使用

。在这种情况下，查找\u next\u sibling（）

方法：

import re
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Germany'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

for li in soup.select('td[width="400"] li'):
    city = li.select_one('a')
    if city.find_next_sibling('a'):
        region = city.find_next_sibling('a').text
    else:
        region = city.find_next_sibling(text=True).strip()
    print('{: <30}{}'.format(city.text, re.findall(r'[^()]+', region)[0]))

您可以共享wikipedia页面的URL吗？区域也是关闭锚定标记（

）之前的文本吗？当然：在某些情况下，区域是超链接，但在其他情况下，区域正好位于关闭锚定标记（）之前谢谢你的链接-我将把它作为一个学习资源来使用正则表达式。谢谢Andrej，出于好奇，使用Beautiful soup来清理Web垃圾文本还是使用正则表达式更好？@JayDoe最好通过BeautifulSoup解析HTML。在大多数情况下，通过regexp解析HTML从来都不是一个好主意。只是to确认，“选择”、“选择一个”和“找到下一个兄弟姐妹”方法都是很好的汤法？谢谢Jay@JayDoe是的，它们是BeautifulSoup的方法。文档在这里：确保您使用的是最新版本！

Aach                          Baden-Württemberg
Aachen                        North Rhine-Westphalia
Aalen                         Baden-Württemberg
Abenberg                      Bavaria
Abensberg                     Bavaria
Achern                        Baden-Württemberg
Achim                         Lower Saxony
Adelsheim                     Baden-Württemberg
Adenau                        Rhineland-Palatinate
Adorf                         Saxony
Ahaus                         North Rhine-Westphalia
Ahlen                         North Rhine-Westphalia
Ahrensburg                    Schleswig-Holstein
Aichach                       Bavaria
Aichtal                       Baden-Württemberg
Aken (Elbe)                   Saxony-Anhalt
Albstadt                      Baden-Württemberg
Alfeld                        Lower Saxony
Allendorf (Lumda)             Hesse
Allstedt                      Saxony-Anhalt

...and so on.