Python 使用startswith函数筛选URL列表_Python_Web Scraping_Beautifulsoup_Startswith

Python 使用startswith函数筛选URL列表

python web-scraping

Python 使用startswith函数筛选URL列表,python,web-scraping,beautifulsoup,startswith,Python,Web Scraping,Beautifulsoup,Startswith,我有下面一段代码，它从一个页面中提取所有链接并将它们放在一个列表中（links=[]），然后传递给函数filter\u links（）。我希望过滤掉任何链接，这些链接与起始链接不在同一个域中，也就是列表中的第一个链接。这就是我所拥有的： import requests from bs4 import BeautifulSoup import re start_url = "http://www.enzymebiosystems.org/" r = requests.get(start_url

我有下面一段代码，它从一个页面中提取所有链接并将它们放在一个列表中（

links=[]

），然后传递给函数

filter\u links（）

。我希望过滤掉任何链接，这些链接与起始链接不在同一个域中，也就是列表中的第一个链接。这就是我所拥有的：

import requests
from bs4 import BeautifulSoup
import re

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])


def filter_links(links):
    filtered_links = []
    for link in links:
        if link.startswith(links[0]):
            filtered_links.append(link)
        return filtered_links


print(filter_links(links))

我使用了内置的startswith函数，但它过滤掉了除起始url之外的所有内容。最后，我想通过这个程序传递几个不同的起始url，因此我需要一种通用的方法来过滤与起始url位于同一域中的url。我想我可以使用regex，但这个函数也应该可以工作？

试试这个：

import requests
from bs4 import BeautifulSoup
import re
import tldextract

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])

def filter_links(links):
    ext = tldextract.extract(start_url)
    domain = ext.domain
    filtered_links = []
    for link in links:
        if domain in link:
            filtered_links.append(link)
    return filtered_links


print(filter_links(links))

注意：

您需要将返回语句从for循环中取出。它只是在迭代一个元素后返回结果，因此只返回列表中的第一项

使用

tldextract

模块更好地从URL中提取域名。如果要明确检查链接是否以

链接[0]

开头，则由您决定

输出：

['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']

试试这个：

import requests
from bs4 import BeautifulSoup
import re
import tldextract

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])

def filter_links(links):
    ext = tldextract.extract(start_url)
    domain = ext.domain
    filtered_links = []
    for link in links:
        if domain in link:
            filtered_links.append(link)
    return filtered_links


print(filter_links(links))

注意：

您需要将返回语句从for循环中取出。它只是在迭代一个元素后返回结果，因此只返回列表中的第一项

使用

tldextract

模块更好地从URL中提取域名。如果要明确检查链接是否以

链接[0]

开头，则由您决定

输出：

['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']

可能的解决办法如果您保留了“包含”域的所有链接，会怎么样

比如说

import pandas as pd

links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])

all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]

# results in a dataframe with links containing "enzymebiosystems".

如果要搜索多个域，请选择可能的解决方案如果您保留了“包含”域的所有链接，会怎么样

比如说

import pandas as pd

links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])

all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]

# results in a dataframe with links containing "enzymebiosystems".

如果要搜索多个域，

好的，因此在

过滤器链接（链接）

中出现缩进错误。函数应该是这样的

def过滤器链接（链接）：
过滤链接=[]
对于链接中的链接：
如果link.startswith（links[0]）：
过滤链接。附加（链接）
返回过滤的链接

注意，在代码中，您将return语句保留在for循环中，因此for循环执行一次，然后返回列表。

希望这有帮助：）

好的，您在

过滤器链接（链接）

中出现了缩进错误。函数应该是这样的

def过滤器链接（链接）：
过滤链接=[]
对于链接中的链接：
如果link.startswith（links[0]）：
过滤链接。附加（链接）
返回过滤的链接

注意，在代码中，您将return语句保留在for循环中，因此for循环执行一次，然后返回列表。

希望这有帮助：）

打印您的链接-很可能您的初始链接不是链接目标网站的根目录，但其中有一些参数与任何其他链接不匹配打印您的链接-很可能您的初始链接不是链接目标网站的根目录，但其中有一些参数与任何其他链接不匹配