Python 返回空列表的re.findall（）函数_Python

Python 返回空列表的re.findall（）函数

python

Python 返回空列表的re.findall（）函数,python,Python,我有以下代码： pattern = re.compile(r"^\s\s<strong>.*</strong>$") matches = [] with open(r"spotifycharts.html", "rt") as current_file: content = current_file.read() for line in content: matches = findall(p

我有以下代码：

pattern = re.compile(r"^\s\s<strong>.*</strong>$")
matches = []

with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

for line in content:
    matches = findall(pattern, line)

print(matches)

pattern=re.compile（r“^\s\s*$”）
匹配项=[]
打开（r“spotifycharts.html”、“rt”）作为当前的_文件：
content=当前_文件.read（）
对于内容中的行：
匹配=findall（图案、线条）
打印（匹配）

我已经检查了模式是否有效，并与html文件中的字符串相匹配。但是，findall（）函数仍然返回一个空列表。是不是我做错了什么

编辑：指出了一个错误，我修复了它。代码运行后，匹配列表仍然为空

pattern = re.compile(r"^\s\s<strong>.*</strong>$")
matches = []

with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

for line in content:
    if findall(pattern, line) != []:
        matches.append(findall(pattern, line))

print(matches)

pattern=re.compile（r“^\s\s*$”）
匹配项=[]
打开（r“spotifycharts.html”、“rt”）作为当前的_文件：
content=当前_文件.read（）
对于内容中的行：
如果findall（图案、线条）！=[]:
matches.append（findall（模式、行））
打印（匹配）

下面是产生相同问题的代码。希望这有帮助

matches = []
with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

matches = findall("^\s\s<strong>.*</strong>$", content)

print(matches)

匹配=[]
打开（r“spotifycharts.html”、“rt”）作为当前的_文件：
content=当前_文件.read（）
matches=findall（“^\s\s*$”，内容）
打印（匹配）

Source HTML:view Source:

我希望您要匹配的行中还有其他内容。您的表达式只允许在一行上有一对begin/end标记，它们之间有内容，但在同一行上它们之前或之后没有内容。我打赌您想使用以下表达式：

"\s\s<strong>.*?</strong>"

“\s\s*？”

使用正则表达式解析HTML就像使用棒球棒清洁牙齿一样。棒球棒是很好的工具，但它们解决的问题不同于牙科刮治器

Python有一个名为BeautifulSoup的HTML解析器，您可以使用

pip安装beautifulsoup4

安装该解析器：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> html = requests.get("https://spotifycharts.com/regional/au/daily/latest").text
>>> bs = BeautifulSoup(html)
>>> [e.text for e in bs.select(".chart-table-track strong")][:3]
['WAP (feat. Megan Thee Stallion)', 'Mood (feat. Iann Dior)', 'Head & Heart (feat. MNEK)']

这里我们使用CSS选择器

“.chart table track strong”

提取所有歌曲标题（我假设这就是您想要的数据…）

另一种方法是使用熊猫：

>>> import pandas as pd
>>> import requests # not needed if you have html5lib
>>> html = requests.get("https://spotifycharts.com/regional/au/daily/latest").text
>>> df = pd.read_html(html)[0]
>>> df[["Track", "Artist"]] = df["Track"].str.split("  by ", expand=True)
>>> df.drop(columns=df.columns[[0, 1, 2]])
                                Track  Streams      Artist
0     WAP (feat. Megan Thee Stallion)   311167     Cardi B
1              Mood (feat. Iann Dior)   295922    24kGoldn
2           Head & Heart (feat. MNEK)   190025  Joel Corry
3    Savage Love (Laxed - Siren Beat)   163776   Jawsh 685
4                         Breaking Me   150560       Topic
..                                ...      ...         ...
195                           Daisies    31092  Katy Perry
196                                21    31088      Polo G
197                     Nobody's Love    31047    Maroon 5
198        Ballin' (with Roddy Ricch)    30862     Mustard
199          Dancing in the Moonlight    30853   Toploader

[200 rows x 3 columns]

只有在最后一行中有匹配项时，才会找到匹配项。也许你想在循环中打印？欢迎使用SO！看起来您正在使用正则表达式解析HTML。您能否显示输入HTML和预期的输出，并提供一个新的输出？见和。谢谢。您正在混合

re.compile

和

findall

（没有

re.

）-通常最好共享自己工作的代码（如@ggorlen所示，这是一个最小的、可复制的示例）。要获得有用的答案，您还需要提供一个相关的HTML示例，或者至少提供一个指向公共页面的链接，这是您正在解析的HTML的一个很好的示例。如果您正在解析HTML，我建议使用BeautifulSoup。您真的在寻找前面正好有两个空格的行吗，后跟开始/结束标记？内的数量？。这就是您的RE要匹配的全部内容。它不是内置的，您需要通过
pip
安装它
pip安装beautifulsoup4
。然而，我的作业要求使用默认的Python模块，我认为不包括BeautifulSoup。此外，如果您只需要做一些简单的事情，并且您还不知道BS，那么棒球棒可能是权宜之计。我喜欢BS，也用过几次，但作为一个regex weenie，我可以看到我自己走了任何一条路。我想不出BS的语法，我只想学习BS或其他一些你最喜欢的语言的典型HTML解析器。这是非常值得的，除非你一次解析HTML，然后终生放弃编程。
pandas
解决方案很好，但它也不是标准的库模块。是的。我最感兴趣的是让他朝着正确的方向前进。实际上，我想知道他是否有两对在同一条线上，在这种情况下，他在这里也会有错误的行为。就我个人而言，我会选择
r'\s*（.*？”
或者甚至
r'\s*（.*？\s*by（.*？
来包括艺术家。（实际上，我可能会使用BeautifulSoup，这超出了本文的范围。）