Python 基于列表执行循环-获取每个页面（子页面）的结果_Python_Loops_For Loop_Web Scraping_Beautifulsoup

Python 基于列表执行循环-获取每个页面（子页面）的结果

python loops for-loop web-scraping

Python 基于列表执行循环-获取每个页面（子页面）的结果,python,loops,for-loop,web-scraping,beautifulsoup,Python,Loops,For Loop,Web Scraping,Beautifulsoup,我试图从url列表中获取每个url的页数。只要我只有一个url，我的代码就可以工作，但是只要我尝试使用url列表，我就只能从一个url获得其余的url。我猜问题与我的循环有关。考虑到我对python和Beautifull汤还不熟悉，我自己也没能发现这个错误 base_url = 'https://www.holidaycheck.de' main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-9

我试图从url列表中获取每个url的页数。只要我只有一个url，我的代码就可以工作，但是只要我尝试使用url列表，我就只能从一个url获得其余的url。我猜问题与我的循环有关。考虑到我对python和Beautifull汤还不熟悉，我自己也没能发现这个错误

base_url = 'https://www.holidaycheck.de'
main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1?p={}'
urls=[]

##Change URL into object (soup)
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html5lib")

#get max page number
soup = BeautifulSoup(r.text, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
last_page=max(res_int)
#print(last_page)


for i in range (1,last_page):
    page = main_page.format(i)
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
       urls = base_url + link.find('a').get('href')+"/-/p/{}"
       print(urls)

到目前为止，一切正常，我获得了最大页码，并获得了每个页面的所有URL。问题在于下面的代码（我相信）：

我试图对列表url中的每个url应用与上面代码中相同的概念-但不是每次获取241的每个url的最大页码，好像我陷入了一个循环

有什么想法吗？非常感谢您的帮助。

您将

URL

等同于循环生成的最后一个链接。
要生成有效的URL列表，您需要在

append（）

上替换

：

编辑：好的，据我所知，您的代码中有几个问题。除了我的初始修复之外，我还概述了我对您的代码希望如何工作的看法和理解：

import requests
from bs4 import BeautifulSoup
base_url = 'https://www.holidaycheck.de'
main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1?p={}'

##Change URL into object (soup)
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html5lib")

#get max page number
soup = BeautifulSoup(r.text, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
last_page=max(res_int)
#print(last_page)


urls = []
for i in range (1,last_page):
    page = main_page.format(i)
    r = requests.get(page) #these 2 rows added
    soup = BeautifulSoup(r.text, 'lxml') #these 2 rows added
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
        try: #also adding try-except for escaping broken/unavailable links
            urls.append(base_url + link.find('a').get('href')+"/-/p/{}")
        except:
            print('no link available', i)

urls = list(set(urls)) #check and drop duplicated in links list

for url in urls: #to loop through the list of urls
    try:
        r = requests.get(url.format(0))
        print(url.format(0))
        soup = BeautifulSoup(r.text, 'lxml')
        daten = soup.find_all('a', {'class':'link'})
    except:
        print('broken link')

    tes = []
    for z in daten:
        tes.append(z.text) #writing each value to res list
#    print(tes)

    tes_int = []
    for z in tes:
        try:
            tes_int.append(int(z))
        except:
            print("current value is not a number")
    try:
        anzahl=max(tes_int)
        print(anzahl)
    except:
        print('maximum cannot be calculated')

检查

url

的值，然后在

中查找url:

循环中的url，我想你会感到惊讶。你没有定义

基本url

也修复了你在第二个代码中的缩进block@B我检查了网址，它们看起来很好。我还随机打开了列表中的链接，它们很有效。难道不是这样吗？不，你没有。循环开头的变量

URL

包含一个URL。请参阅Dmitriy Fialkovkiy的答案。不要忘记初始化

URL

。更好的是，用列表理解替换

for

循环。

URL

在上面的问题代码中初始化，但在这里添加它是有意义的。添加了

URL

。好的，我根据@Dmitriyfilkovskiy注释修复了代码，但仍然没有得到子页面的最大页码？用列表替换for循环是什么意思？在一个嵌套行中写入所有内容？顺便说一句，@Nadine，我不是100%确定缩进是否正确，因为您仍然没有在question@DmitriyFialkovskiy缩进是什么意思？不同代码行的排列？老实说，到目前为止，这主要是基于修复错误消息（当我删除/添加空格时，它们消失了）。我还在学习教程的帮助，但我认为这个代码太先进了（我发现新事物每天都要考虑）

urls = []
for i in range (1,last_page):
    page = main_page.format(i)
    r = requests.get(page) #these 2 rows added
    soup = BeautifulSoup(r.text, 'lxml') #these 2 rows added
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
        try:
            urls.append(base_url + link.find('a').get('href')+"/-/p/{}")
        except:
            print('no link available', i)
print(urls)

import requests
from bs4 import BeautifulSoup
base_url = 'https://www.holidaycheck.de'
main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1?p={}'

##Change URL into object (soup)
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html5lib")

#get max page number
soup = BeautifulSoup(r.text, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
last_page=max(res_int)
#print(last_page)


urls = []
for i in range (1,last_page):
    page = main_page.format(i)
    r = requests.get(page) #these 2 rows added
    soup = BeautifulSoup(r.text, 'lxml') #these 2 rows added
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
        try: #also adding try-except for escaping broken/unavailable links
            urls.append(base_url + link.find('a').get('href')+"/-/p/{}")
        except:
            print('no link available', i)

urls = list(set(urls)) #check and drop duplicated in links list

for url in urls: #to loop through the list of urls
    try:
        r = requests.get(url.format(0))
        print(url.format(0))
        soup = BeautifulSoup(r.text, 'lxml')
        daten = soup.find_all('a', {'class':'link'})
    except:
        print('broken link')

    tes = []
    for z in daten:
        tes.append(z.text) #writing each value to res list
#    print(tes)

    tes_int = []
    for z in tes:
        try:
            tes_int.append(int(z))
        except:
            print("current value is not a number")
    try:
        anzahl=max(tes_int)
        print(anzahl)
    except:
        print('maximum cannot be calculated')