如何在python中创建递归循环_Python_Recursion_Request_Web Scraping_Beautifulsoup

如何在python中创建递归循环

python recursion web-scraping

如何在python中创建递归循环,python,recursion,request,web-scraping,beautifulsoup,Python,Recursion,Request,Web Scraping,Beautifulsoup,我正在尝试制作一个网页刮板，可以使用BeautifulSoup循环浏览网页为此，我尝试编写一个函数，调用我正在查找的页面，找到next按钮的Href打印结果，然后将其分配给请求，并递归地重复该函数打印next按钮的每个新值这就是我所拥有的，我真的不知道它有什么不起作用。我没有发现错误，所以我认为我的结构可能不正确先谢谢你 import urllib.request from bs4 import BeautifulSoup import re url = "http://www.cala

我正在尝试制作一个网页刮板，可以使用BeautifulSoup循环浏览网页

为此，我尝试编写一个函数，调用我正在查找的页面，找到next按钮的Href打印结果，然后将其分配给请求，并递归地重复该函数打印next按钮的每个新值

这就是我所拥有的，我真的不知道它有什么不起作用。我没有发现错误，所以我认为我的结构可能不正确

先谢谢你

import urllib.request
from bs4 import BeautifulSoup
import re

url = "http://www.calaiswine.co.uk/products/type/all-wines/1.aspx"
root_url = "http://www.calaiswine.co.uk"
first_index_url =  '/products/type/all-wines/1.aspx'

htmlFile = urllib.request.urlopen(url);

htmlText = htmlFile.read();

soup = BeautifulSoup(htmlText);

def cycle_to_next_page(foo):
    response = urllib.request.urlopen( root_url + foo)
    soup = BeautifulSoup(response)
    items = [a.attrs.get('href') for a in soup.findAll('a', title='Next')]
    print (cycle_to_next_page(items[0]))

cycle_to_next_page(first_index_url)

递归函数不返回任何内容，它只是打印

在Python中，不返回的函数被视为返回

None

。因此，Python理解您的

循环到下一页（第一个索引url）

指令，就好像您理解了：

print(print(None))

我个人不会在这个例子中使用递归。只需一个基本的

循环，循环遍历项
您的递归函数不返回任何内容，它只是打印
在Python中，不返回的函数被视为返回None
。因此，Python理解您的循环到下一页（第一个索引url）
指令，就好像您理解了：
print(print(None))

我个人不会在这个例子中使用递归。只需一个基本的循环，循环遍历项
您的递归函数不返回任何内容，它只是打印
在Python中，不返回的函数被视为返回None
。因此，Python理解您的循环到下一页（第一个索引url）
指令，就好像您理解了：
print(print(None))

我个人不会在这个例子中使用递归。只需一个基本的循环，循环遍历项
您的递归函数不返回任何内容，它只是打印
在Python中，不返回的函数被视为返回None
。因此，Python理解您的循环到下一页（第一个索引url）
指令，就好像您理解了：
print(print(None))

我个人不会在这个例子中使用递归。正如@Jivan所解释的那样，只需执行一个基本的循环，循环遍历项
删除打印
，就可以递归调用函数，而且您也不需要重复第一次“urllib.urlopen”调用，也可以使用相同的函数打开初始页面。大概是这样的：
import urllib
from bs4 import BeautifulSoup

root_url = "http://www.calaiswine.co.uk"
first_index_url =  '/products/type/all-wines/1.aspx'


def cycle_to_next_page(link):
    response = urllib.urlopen(root_url+link)
    soup = BeautifulSoup(response.read())
    # my bs4 use find_all instead
    items = [a.attrs.get('href') for a in soup.find_all('a', title="Next")]
    print items[0]
    if items[0]:
        # here is the recursive function call, do a proper return, not print
        return cycle_to_next_page(items[0])
    print "crawling completed"
    return

# you can open your initial page with this function too
cycle_to_next_page(first_index_url)

#results:
/products/type/all-wines/2.aspx
/products/type/all-wines/3.aspx
/products/type/all-wines/4.aspx
...

另外，不确定您只需要项[0]还是所有项，无论如何，您可以更改逻辑并相应地调用函数。
希望这有帮助
 删除您的print
，就像@Jivan解释的那样，以递归方式实际调用函数，并且您也不需要重复第一次“urllib.urlopen”调用，您也可以使用相同的函数打开初始页面。大概是这样的：
import urllib
from bs4 import BeautifulSoup

root_url = "http://www.calaiswine.co.uk"
first_index_url =  '/products/type/all-wines/1.aspx'


def cycle_to_next_page(link):
    response = urllib.urlopen(root_url+link)
    soup = BeautifulSoup(response.read())
    # my bs4 use find_all instead
    items = [a.attrs.get('href') for a in soup.find_all('a', title="Next")]
    print items[0]
    if items[0]:
        # here is the recursive function call, do a proper return, not print
        return cycle_to_next_page(items[0])
    print "crawling completed"
    return

# you can open your initial page with this function too
cycle_to_next_page(first_index_url)

#results:
/products/type/all-wines/2.aspx
/products/type/all-wines/3.aspx
/products/type/all-wines/4.aspx
...

另外，不确定您只需要项[0]还是所有项，无论如何，您可以更改逻辑并相应地调用函数。
希望这有帮助
 删除您的print
，就像@Jivan解释的那样，以递归方式实际调用函数，并且您也不需要重复第一次“urllib.urlopen”调用，您也可以使用相同的函数打开初始页面。大概是这样的：
import urllib
from bs4 import BeautifulSoup

root_url = "http://www.calaiswine.co.uk"
first_index_url =  '/products/type/all-wines/1.aspx'


def cycle_to_next_page(link):
    response = urllib.urlopen(root_url+link)
    soup = BeautifulSoup(response.read())
    # my bs4 use find_all instead
    items = [a.attrs.get('href') for a in soup.find_all('a', title="Next")]
    print items[0]
    if items[0]:
        # here is the recursive function call, do a proper return, not print
        return cycle_to_next_page(items[0])
    print "crawling completed"
    return

# you can open your initial page with this function too
cycle_to_next_page(first_index_url)

#results:
/products/type/all-wines/2.aspx
/products/type/all-wines/3.aspx
/products/type/all-wines/4.aspx
...

另外，不确定您只需要项[0]还是所有项，无论如何，您可以更改逻辑并相应地调用函数。
希望这有帮助
 删除您的print
，就像@Jivan解释的那样，以递归方式实际调用函数，并且您也不需要重复第一次“urllib.urlopen”调用，您也可以使用相同的函数打开初始页面。大概是这样的：
import urllib
from bs4 import BeautifulSoup

root_url = "http://www.calaiswine.co.uk"
first_index_url =  '/products/type/all-wines/1.aspx'


def cycle_to_next_page(link):
    response = urllib.urlopen(root_url+link)
    soup = BeautifulSoup(response.read())
    # my bs4 use find_all instead
    items = [a.attrs.get('href') for a in soup.find_all('a', title="Next")]
    print items[0]
    if items[0]:
        # here is the recursive function call, do a proper return, not print
        return cycle_to_next_page(items[0])
    print "crawling completed"
    return

# you can open your initial page with this function too
cycle_to_next_page(first_index_url)

#results:
/products/type/all-wines/2.aspx
/products/type/all-wines/3.aspx
/products/type/all-wines/4.aspx
...

另外，不确定您只需要项[0]还是所有项，无论如何，您可以更改逻辑并相应地调用函数。
希望这有帮助
 你的递归循环是如何终止的？你想用什么语言写？不应该有；在你的代码中，你的递归循环是如何终止的？你想用什么语言写？不应该有；在你的代码中，你的递归循环是如何终止的？你想用什么语言写？不应该有；在你的代码中，你的递归循环是如何终止的？你想用什么语言写？不应该有；在代码中