如何使用python获取网页的所有链接,无论是直接链接到网页还是间接链接到网页?

如何使用python获取网页的所有链接,无论是直接链接到网页还是间接链接到网页?,python,beautifulsoup,Python,Beautifulsoup,我需要得到所有相关的链接,给一个网站的主页网址,所有的链接是指在主页中的链接,加上新的链接,是通过使用主页中的链接链接达到 我正在使用BeautifulSoup python库。我也在考虑用刮痧。 下面的代码只提取链接到主页的链接 from bs4 import BeautifulSoup import requests url = "https://www.dataquest.io" def links(url): html = requests.get(url).content

我需要得到所有相关的链接,给一个网站的主页网址,所有的链接是指在主页中的链接,加上新的链接,是通过使用主页中的链接链接达到

我正在使用BeautifulSoup python库。我也在考虑用刮痧。 下面的代码只提取链接到主页的链接

from bs4 import BeautifulSoup
import requests


url = "https://www.dataquest.io"

def links(url):
    html = requests.get(url).content
    bsObj = BeautifulSoup(html, 'lxml')

    links = bsObj.findAll('a')
    finalLinks = set()
    for link in links:
        finalLinks.add(link)

    return finalLinks

print(links(url))
linklis = list(links(url))

for l in linklis:
    print(l)
    print("\n")


我需要一个列表,其中包括可以通过主页URL访问的所有URL/链接(可以直接或间接链接到主页)。

此脚本将打印URL上找到的所有链接
https://www.dataquest.io

from bs4 import BeautifulSoup
import requests

url = "https://www.dataquest.io"

def links(url):
    html = requests.get(url).content
    bsObj = BeautifulSoup(html, 'lxml')

    links = bsObj.select('a[href]')

    final_links = set()

    for link in links:
        url_string = link['href'].rstrip('/')
        if 'javascript:' in url_string or url_string.startswith('#'):
            continue
        elif 'http' not in url_string and not url_string.startswith('//'):
            url_string = 'https://www.dataquest.io' + url_string
        elif 'dataquest.io' not in url_string:
            continue
        final_links.add(url_string)

    return final_links

for l in sorted( links(url) ):
    print(l)
印刷品:

http://app.dataquest.io/login
http://app.dataquest.io/signup
https://app.dataquest.io/signup
https://www.dataquest.io
https://www.dataquest.io/about-us
https://www.dataquest.io/blog
https://www.dataquest.io/blog/learn-data-science
https://www.dataquest.io/blog/learn-python-the-right-way
https://www.dataquest.io/blog/the-perfect-data-science-learning-tool
https://www.dataquest.io/blog/topics/student-stories
https://www.dataquest.io/chat
https://www.dataquest.io/course
https://www.dataquest.io/course/algorithms-and-data-structures
https://www.dataquest.io/course/apis-and-scraping
https://www.dataquest.io/course/building-a-data-pipeline
https://www.dataquest.io/course/calculus-for-machine-learning
https://www.dataquest.io/course/command-line-elements
https://www.dataquest.io/course/command-line-intermediate
https://www.dataquest.io/course/data-exploration
https://www.dataquest.io/course/data-structures-algorithms
https://www.dataquest.io/course/decision-trees
https://www.dataquest.io/course/deep-learning-fundamentals
https://www.dataquest.io/course/exploratory-data-visualization
https://www.dataquest.io/course/exploring-topics
https://www.dataquest.io/course/git-and-vcs
https://www.dataquest.io/course/improving-code-performance
https://www.dataquest.io/course/intermediate-r-programming
https://www.dataquest.io/course/intro-to-r
https://www.dataquest.io/course/kaggle-fundamentals
https://www.dataquest.io/course/linear-algebra-for-machine-learning
https://www.dataquest.io/course/linear-regression-for-machine-learning
https://www.dataquest.io/course/machine-learning-fundamentals
https://www.dataquest.io/course/machine-learning-intermediate
https://www.dataquest.io/course/machine-learning-project
https://www.dataquest.io/course/natural-language-processing
https://www.dataquest.io/course/optimizing-postgres-databases-data-engineering
https://www.dataquest.io/course/pandas-fundamentals
https://www.dataquest.io/course/pandas-large-datasets
https://www.dataquest.io/course/postgres-for-data-engineers
https://www.dataquest.io/course/probability-fundamentals
https://www.dataquest.io/course/probability-statistics-intermediate
https://www.dataquest.io/course/python-data-cleaning-advanced
https://www.dataquest.io/course/python-datacleaning
https://www.dataquest.io/course/python-for-data-science-fundamentals
https://www.dataquest.io/course/python-for-data-science-intermediate
https://www.dataquest.io/course/python-programming-advanced
https://www.dataquest.io/course/r-data-cleaning
https://www.dataquest.io/course/r-data-cleaning-advanced
https://www.dataquest.io/course/r-data-viz
https://www.dataquest.io/course/recursion-and-tree-structures
https://www.dataquest.io/course/spark-map-reduce
https://www.dataquest.io/course/sql-databases-advanced
https://www.dataquest.io/course/sql-fundamentals
https://www.dataquest.io/course/sql-fundamentals-r
https://www.dataquest.io/course/sql-intermediate-r
https://www.dataquest.io/course/sql-joins-relations
https://www.dataquest.io/course/statistics-fundamentals
https://www.dataquest.io/course/statistics-intermediate
https://www.dataquest.io/course/storytelling-data-visualization
https://www.dataquest.io/course/text-processing-cli
https://www.dataquest.io/directory
https://www.dataquest.io/forum
https://www.dataquest.io/help
https://www.dataquest.io/path/data-analyst
https://www.dataquest.io/path/data-analyst-r
https://www.dataquest.io/path/data-engineer
https://www.dataquest.io/path/data-scientist
https://www.dataquest.io/privacy
https://www.dataquest.io/subscribe
https://www.dataquest.io/terms
https://www.dataquest.io/were-hiring
https://www.dataquest.io/wp-content/uploads/2019/03/db.png
https://www.dataquest.io/wp-content/uploads/2019/03/home-code-1.jpg
https://www.dataquest.io/wp-content/uploads/2019/03/python.png
编辑:将选择器更改为
a[href]

EDIT2:基本递归爬虫程序:

def crawl(urls, seen=set()):
    for url in urls:
        if url not in seen:
            print(url)
            seen.add(url)
            new_links = links(url)
            crawl(urls.union(new_links), seen)

starting_links = links(url)
crawl(starting_links)

您好,首先谢谢,我想知道这是不是在访问以前找到的链接时递归检查新链接。@Rink16要检查链接,请按照列表中的每个URL调用
links()
函数。但有必要设置一些停止条件,以中断循环。scrapy的功能与上述相同,或者它会对主页中的所有链接以及链接网页中的链接进行爬网。@Rink16我编辑了我的答案,并添加了一个原始爬网器。事实上,Scrapy会更好solution@Rink16这只是一个例子。。。您可以根据您的用例使用并扩展它。但对于爬行,我会看一下
Scrapy
,它有内置的爬行功能。