Python 如何使用beautifulsoup递归查找网页中的所有链接?

Python 如何使用beautifulsoup递归查找网页中的所有链接?,python,recursion,beautifulsoup,Python,Recursion,Beautifulsoup,我一直在尝试使用我找到的一些代码递归查找给定URL的所有链接: import urllib2 from bs4 import BeautifulSoup url = "http://francaisauthentique.libsyn.com/" def recursiveUrl(url,depth): if depth == 5: return url else: page=urllib2.urlopen(url) soup

我一直在尝试使用我找到的一些代码递归查找给定URL的所有链接:

import urllib2
from bs4 import BeautifulSoup

url = "http://francaisauthentique.libsyn.com/"

def recursiveUrl(url,depth):

    if depth == 5:
        return url
    else:
        page=urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        newlink = soup.find('a') #find just the first one
        if len(newlink) == 0:
            return url
        else:
            return url, recursiveUrl(newlink,depth+1)


def getLinks(url):
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    links = soup.find_all('a')
    for link in links:
        links.append(recursiveUrl(link,0))
    return links

links = getLinks(url)
print(links)
除了警告之外

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 28 of the file downloader.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")
我得到以下错误:

Traceback (most recent call last):
  File "downloader.py", line 28, in <module>
    links = getLinks(url)
  File "downloader.py", line 25, in getLinks
    links.append(recursiveUrl(link,0))
  File "downloader.py", line 11, in recursiveUrl
    page=urllib2.urlopen(url)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 396, in open
    protocol = req.get_type()
TypeError: 'NoneType' object is not callable
回溯(最近一次呼叫最后一次):
文件“downloader.py”,第28行,在
links=getLinks(url)
getLinks中第25行的文件“downloader.py”
追加(递归URL(链接,0))
递归URL中第11行的文件“downloader.py”
page=urlib2.urlopen(url)
文件“/usr/lib/python2.7/urllib2.py”,urlopen中的第127行
return\u opener.open(url、数据、超时)
文件“/usr/lib/python2.7/urllib2.py”,第396行,打开
协议=请求获取类型()
TypeError:“非类型”对象不可调用

问题是什么?

您的递归url试图访问无效的url链接,如:/webpage/category/general,它是您从其中一个href链接中提取的值

您应该将提取的href值附加到网站的url,然后尝试打开该网页。您需要研究递归算法,因为我不知道您想要实现什么

代码:

输出:

http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/10
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/09
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/08
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/07
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general

这段代码将递归地转到每个链接,并继续向列表中添加完整的URL。最终输出将是一组URL

import requests
from bs4 import BeautifulSoup

listUrl = []

def recursiveUrl(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    links = soup.find_all('a')
    if links is None or len(links) == 0:
        listUrl.append(url)
        print(url)
        return 1;
    else:
        listUrl.append(url)
        print(url)
        for link in links:
            #print(url+link['href'][1:])
            recursiveUrl(url+link['href'][1:])


recursiveUrl('http://target.com')
print(listUrl)

我认为您正在将BeautifulSoup对象传递给
urlopen
,而不是URL。尝试类似于
链接['href']
,但一定要先检查它是否存在。谢谢Thomas,但现在我收到一个错误“ValueError:未知url类型:/webpage/categery/general”。可能是因为这是一个相对链接而不是绝对链接?@Alex correct:)你考虑过只使用
scrapy
btw吗?是否所有的腿工作以下链接到某些深度限制到某些网址等。。。对于你…?我尝试了“wget”,它没有下载一个文件
import requests
from bs4 import BeautifulSoup

listUrl = []

def recursiveUrl(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    links = soup.find_all('a')
    if links is None or len(links) == 0:
        listUrl.append(url)
        print(url)
        return 1;
    else:
        listUrl.append(url)
        print(url)
        for link in links:
            #print(url+link['href'][1:])
            recursiveUrl(url+link['href'][1:])


recursiveUrl('http://target.com')
print(listUrl)