Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/349.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/loops/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 基于列表执行循环-获取每个页面(子页面)的结果_Python_Loops_For Loop_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 基于列表执行循环-获取每个页面(子页面)的结果

Python 基于列表执行循环-获取每个页面(子页面)的结果,python,loops,for-loop,web-scraping,beautifulsoup,Python,Loops,For Loop,Web Scraping,Beautifulsoup,我试图从url列表中获取每个url的页数。只要我只有一个url,我的代码就可以工作,但是只要我尝试使用url列表,我就只能从一个url获得其余的url。我猜问题与我的循环有关。考虑到我对python和Beautifull汤还不熟悉,我自己也没能发现这个错误 base_url = 'https://www.holidaycheck.de' main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-9

我试图从url列表中获取每个url的页数。只要我只有一个url,我的代码就可以工作,但是只要我尝试使用url列表,我就只能从一个url获得其余的url。我猜问题与我的循环有关。考虑到我对python和Beautifull汤还不熟悉,我自己也没能发现这个错误

base_url = 'https://www.holidaycheck.de'
main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1?p={}'
urls=[]

##Change URL into object (soup)
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html5lib")

#get max page number
soup = BeautifulSoup(r.text, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
last_page=max(res_int)
#print(last_page)


for i in range (1,last_page):
    page = main_page.format(i)
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
       urls = base_url + link.find('a').get('href')+"/-/p/{}"
       print(urls)
到目前为止,一切正常,我获得了最大页码,并获得了每个页面的所有URL。问题在于下面的代码(我相信):

我试图对列表url中的每个url应用与上面代码中相同的概念-但不是每次获取241的每个url的最大页码,好像我陷入了一个循环

有什么想法吗?非常感谢您的帮助。

您将
URL
等同于循环生成的最后一个链接。
要生成有效的URL列表,您需要在
append()
上替换
=

编辑:好的,据我所知,您的代码中有几个问题。除了我的初始修复之外,我还概述了我对您的代码希望如何工作的看法和理解:

import requests
from bs4 import BeautifulSoup
base_url = 'https://www.holidaycheck.de'
main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1?p={}'

##Change URL into object (soup)
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html5lib")

#get max page number
soup = BeautifulSoup(r.text, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
last_page=max(res_int)
#print(last_page)


urls = []
for i in range (1,last_page):
    page = main_page.format(i)
    r = requests.get(page) #these 2 rows added
    soup = BeautifulSoup(r.text, 'lxml') #these 2 rows added
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
        try: #also adding try-except for escaping broken/unavailable links
            urls.append(base_url + link.find('a').get('href')+"/-/p/{}")
        except:
            print('no link available', i)

urls = list(set(urls)) #check and drop duplicated in links list

for url in urls: #to loop through the list of urls
    try:
        r = requests.get(url.format(0))
        print(url.format(0))
        soup = BeautifulSoup(r.text, 'lxml')
        daten = soup.find_all('a', {'class':'link'})
    except:
        print('broken link')

    tes = []
    for z in daten:
        tes.append(z.text) #writing each value to res list
#    print(tes)

    tes_int = []
    for z in tes:
        try:
            tes_int.append(int(z))
        except:
            print("current value is not a number")
    try:
        anzahl=max(tes_int)
        print(anzahl)
    except:
        print('maximum cannot be calculated')

检查
url
的值,然后在
中查找url:
循环中的url,我想你会感到惊讶。你没有定义
基本url
也修复了你在第二个代码中的缩进block@B我检查了网址,它们看起来很好。我还随机打开了列表中的链接,它们很有效。难道不是这样吗?不,你没有。循环开头的变量
URL
包含一个URL。请参阅Dmitriy Fialkovkiy的答案。不要忘记初始化
URL
。更好的是,用列表理解替换
for
循环。
URL
在上面的问题代码中初始化,但在这里添加它是有意义的。添加了
URL
。好的,我根据@Dmitriyfilkovskiy注释修复了代码,但仍然没有得到子页面的最大页码?用列表替换for循环是什么意思?在一个嵌套行中写入所有内容?顺便说一句,@Nadine,我不是100%确定缩进是否正确,因为您仍然没有在question@DmitriyFialkovskiy缩进是什么意思?不同代码行的排列?老实说,到目前为止,这主要是基于修复错误消息(当我删除/添加空格时,它们消失了)。我还在学习教程的帮助,但我认为这个代码太先进了(我发现新事物每天都要考虑)
urls = []
for i in range (1,last_page):
    page = main_page.format(i)
    r = requests.get(page) #these 2 rows added
    soup = BeautifulSoup(r.text, 'lxml') #these 2 rows added
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
        try:
            urls.append(base_url + link.find('a').get('href')+"/-/p/{}")
        except:
            print('no link available', i)
print(urls)
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.holidaycheck.de'
main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1?p={}'

##Change URL into object (soup)
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html5lib")

#get max page number
soup = BeautifulSoup(r.text, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
last_page=max(res_int)
#print(last_page)


urls = []
for i in range (1,last_page):
    page = main_page.format(i)
    r = requests.get(page) #these 2 rows added
    soup = BeautifulSoup(r.text, 'lxml') #these 2 rows added
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
        try: #also adding try-except for escaping broken/unavailable links
            urls.append(base_url + link.find('a').get('href')+"/-/p/{}")
        except:
            print('no link available', i)

urls = list(set(urls)) #check and drop duplicated in links list

for url in urls: #to loop through the list of urls
    try:
        r = requests.get(url.format(0))
        print(url.format(0))
        soup = BeautifulSoup(r.text, 'lxml')
        daten = soup.find_all('a', {'class':'link'})
    except:
        print('broken link')

    tes = []
    for z in daten:
        tes.append(z.text) #writing each value to res list
#    print(tes)

    tes_int = []
    for z in tes:
        try:
            tes_int.append(int(z))
        except:
            print("current value is not a number")
    try:
        anzahl=max(tes_int)
        print(anzahl)
    except:
        print('maximum cannot be calculated')