Python 使用startswith函数筛选URL列表

Python 使用startswith函数筛选URL列表,python,web-scraping,beautifulsoup,startswith,Python,Web Scraping,Beautifulsoup,Startswith,我有下面一段代码,它从一个页面中提取所有链接并将它们放在一个列表中(links=[]),然后传递给函数filter\u links()。 我希望过滤掉任何链接,这些链接与起始链接不在同一个域中,也就是列表中的第一个链接。这就是我所拥有的: import requests from bs4 import BeautifulSoup import re start_url = "http://www.enzymebiosystems.org/" r = requests.get(start_url

我有下面一段代码,它从一个页面中提取所有链接并将它们放在一个列表中(
links=[]
),然后传递给函数
filter\u links()
。 我希望过滤掉任何链接,这些链接与起始链接不在同一个域中,也就是列表中的第一个链接。这就是我所拥有的:

import requests
from bs4 import BeautifulSoup
import re

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])


def filter_links(links):
    filtered_links = []
    for link in links:
        if link.startswith(links[0]):
            filtered_links.append(link)
        return filtered_links


print(filter_links(links))
我使用了内置的startswith函数,但它过滤掉了除起始url之外的所有内容。 最后,我想通过这个程序传递几个不同的起始url,因此我需要一种通用的方法来过滤与起始url位于同一域中的url。我想我可以使用regex,但这个函数也应该可以工作?

试试这个:

import requests
from bs4 import BeautifulSoup
import re
import tldextract

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])

def filter_links(links):
    ext = tldextract.extract(start_url)
    domain = ext.domain
    filtered_links = []
    for link in links:
        if domain in link:
            filtered_links.append(link)
    return filtered_links


print(filter_links(links))
注意

  • 您需要将返回语句从for循环中取出。它只是在迭代一个元素后返回结果,因此只返回列表中的第一项
  • 使用
    tldextract
    模块更好地从URL中提取域名。如果要明确检查链接是否以
    链接[0]
    开头,则由您决定
  • 输出

    ['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']
    
    试试这个:

    import requests
    from bs4 import BeautifulSoup
    import re
    import tldextract
    
    start_url = "http://www.enzymebiosystems.org/"
    r = requests.get(start_url)
    html_content = r.text
    soup = BeautifulSoup(html_content, features='lxml')
    links = []
    for tag in soup.find_all('a', href=True):
        links.append(tag['href'])
    
    def filter_links(links):
        ext = tldextract.extract(start_url)
        domain = ext.domain
        filtered_links = []
        for link in links:
            if domain in link:
                filtered_links.append(link)
        return filtered_links
    
    
    print(filter_links(links))
    
    注意

  • 您需要将返回语句从for循环中取出。它只是在迭代一个元素后返回结果,因此只返回列表中的第一项
  • 使用
    tldextract
    模块更好地从URL中提取域名。如果要明确检查链接是否以
    链接[0]
    开头,则由您决定
  • 输出

    ['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']
    
    可能的解决办法 如果您保留了“包含”域的所有链接,会怎么样

    比如说

    import pandas as pd
    
    links = []
    for tag in soup.find_all('a', href=True):
        links.append(tag['href'])
    
    all_links = pd.DataFrame(links, columns=["Links"])
    enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]
    
    # results in a dataframe with links containing "enzymebiosystems". 
    
    如果要搜索多个域,请选择可能的解决方案 如果您保留了“包含”域的所有链接,会怎么样

    比如说

    import pandas as pd
    
    links = []
    for tag in soup.find_all('a', href=True):
        links.append(tag['href'])
    
    all_links = pd.DataFrame(links, columns=["Links"])
    enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]
    
    # results in a dataframe with links containing "enzymebiosystems". 
    

    如果要搜索多个域,

    好的,因此在
    过滤器链接(链接)
    中出现缩进错误。函数应该是这样的

    def过滤器链接(链接):
    过滤链接=[]
    对于链接中的链接:
    如果link.startswith(links[0]):
    过滤链接。附加(链接)
    返回过滤的链接
    
    注意,在代码中,您将return语句保留在for循环中,因此for循环执行一次,然后返回列表。


    希望这有帮助:)

    好的,您在
    过滤器链接(链接)
    中出现了缩进错误。函数应该是这样的

    def过滤器链接(链接):
    过滤链接=[]
    对于链接中的链接:
    如果link.startswith(links[0]):
    过滤链接。附加(链接)
    返回过滤的链接
    
    注意,在代码中,您将return语句保留在for循环中,因此for循环执行一次,然后返回列表。


    希望这有帮助:)

    打印您的链接-很可能您的初始链接不是链接目标网站的根目录,但其中有一些参数与任何其他链接不匹配打印您的链接-很可能您的初始链接不是链接目标网站的根目录,但其中有一些参数与任何其他链接不匹配