Python HREF值使用BS4在网页中搜索_Python_Beautifulsoup

Python HREF值使用BS4在网页中搜索

python

Python HREF值使用BS4在网页中搜索,python,beautifulsoup,Python,Beautifulsoup,我正在开发第三方应用程序，其中我已经阅读了网页源内容的视图。从那里我们只需要收集一些href内容值，这些值的模式类似于/aems/file/filegetrevision.do？fileEntityId。可能吗？我的一个给了我所有的href值 HTML*（HTML的一部分）* 谢谢谢谢，是的，只需为href属性使用适当的过滤器即可。像 def filter(href): return '/aems/file/filegetrevision' in href soup.find_all

我正在开发第三方应用程序，其中我已经阅读了网页源内容的视图。从那里我们只需要收集一些

href

内容值，这些值的模式类似于

/aems/file/filegetrevision.do？fileEntityId

。可能吗？我的一个给了我所有的

href

值

HTML*（HTML的一部分）*

谢谢

谢谢，

是的，只需为

href

属性使用适当的过滤器即可。像

def filter(href):
    return '/aems/file/filegetrevision' in href

soup.find_all('a', href=filter)

除了函数外，还可以使用

RegexObject

对象作为过滤器：

filter = re.compile(some_regular_expression)
soup.find_all('a', href=filter)

查看文档：

@CRUSADER是的，我尝试过，但没有成功。为你找到上面！在这种情况下，我可能会使用

href.startswith（“…”）

。regex示例不应该是

re.compile（“…”）.match

或者

partial（re.match，“…”）

？@JonClements不需要这样，BS groks

RegexObject

s和callables一样+1对于

startswith

，我不确定OP到底需要什么过滤器，但它可能会很方便。你是对的（出于某种原因，我认为它没有这样做-我可能会将它与另一个库混淆）如果你传入正则表达式对象，Beautiful Soup将使用其匹配（）对该正则表达式进行过滤方法。（为了清楚起见，可能值得将其添加到您的答案中）-无论如何+1。@Kos我想根据patteren

/aems/file/filegetrevision.do？fileEntityId

从网页中取出所有

href

值。我认为你在回答中提到的正确答案。：）@VBSlover如果所有URL都以该URL开头，那么使用诸如@JonClements suggered之类的

href.startswith（pattern）

将比href中的

pattern更快，如果这有区别的话。
def filter(href):
    return '/aems/file/filegetrevision' in href

soup.find_all('a', href=filter)

filter = re.compile(some_regular_expression)
soup.find_all('a', href=filter)