Python 正则表达式不在bs4中工作_Python_Regex_Urllib2_Bs4

Python 正则表达式不在bs4中工作

python regex

Python 正则表达式不在bs4中工作,python,regex,urllib2,bs4,Python,Regex,Urllib2,Bs4,我试图从watchseriesfree.to网站上的特定文件宿主中提取一些链接。在下面的例子中，我需要rapidvideo链接，因此我使用正则表达式过滤掉包含rapidvideo文本的标记 import re import urllib2 from bs4 import BeautifulSoup def gethtml(link): req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"}) con

我试图从watchseriesfree.to网站上的特定文件宿主中提取一些链接。在下面的例子中，我需要rapidvideo链接，因此我使用正则表达式过滤掉包含rapidvideo文本的标记

import re
import urllib2
from bs4 import BeautifulSoup

def gethtml(link):
    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


def findLatest():
    url = "https://watchseriesfree.to/serie/Madam-Secretary"
    head = "https://watchseriesfree.to"

    soup = BeautifulSoup(gethtml(url), 'html.parser')
    latep = soup.find("a", title=re.compile('Latest Episode'))

    soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
    firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

    return firstVod

print(findLatest())

但是，上面的代码返回一个空白列表。我做错了什么？

问题在于：

firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

当

BeautifulSoup

将应用文本正则表达式模式时，它将使用所有匹配的

tr

元素的值。现在，

.string

有一个重要的警告-当一个元素有多个子元素时，
.string
是
None
：

如果标记包含多个内容，则不清楚

.string

应该引用什么，因此

.string

被定义为

None

因此，您没有结果

您可以使用并调用

.get_text（）

，检查

tr

元素的实际文本：

注意：

findAll

在bs4中似乎被重命名为

find_all

。（显然，bs3版本一直存在，但我还是会更新您的代码。）

find_all

函数签名也没有

text

参数，而是

string

参数。

soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())